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The  validation  of  predictor  weights,  derived  in  one 
sample,  by  computing  the  correlation  of  the  weighted  sum 
of  the  predictors  with  the  criterion  in  new  samples  is 
called  cross-validation.  The  technique  applies  to  any 
method  of  calculating  the  predictor  'weights.  In  this 
study  three  prediction  methods  were  compared  by  cross- 
validation— multiple  regression  on  the  predictors,  on  the 
principal  components  of  the  predictors,  and  on  the  prin¬ 
cipal  predictors.  Prediction  from  the  principal  predictors 
is  only  possible  when  there  are  several  criterion  variables. 

In  order  to  discover  the  parameters  of  the  multivariate 
distribution  which  affect  the  choice  of  prediction  method 
and  the  number  of  principal  components  or  principal  pro- 
dictors  to  include  in  the  regression,  a  large  number  of 
distributions  were  simulated  on  a  computer  and  samples 
generated  from  these  distributions.  The  population  dis¬ 
tributions  varied  in  the  following  parameters :  n,  the 
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number  of  predictors,  n,  the  number  of  criteria,  p  ,  tho 

squared  multiple  correlation  in  the  case  of  one  criterion 

or  the  average  squared  multiple  correlation  of  m  criteria 

2 

when  m  >1;  and  ir  ,  the  average  predictor  variance  related 
to  the  criteria. 


iii 

\  typical  calculation  consisted  of  the  following 
steps:  generation  of  a  population  distribution  for  a  set 
of  values  of  the  parameters;  generation  of  two  samples  of 
size  ’•  from  this  population;  calculation,  in  one  sample, 
of  the  predictor  weights  for  one  or  more  prediction  methods; 
and  validation  of  these  weights  in  the  second  sample. 

!i  large  number  of  populations  were  generated,  varying  in 
the  values  of  the  parameters. 

In  cross-validating  one  criterion  variable,  it  was 

shown  that  the  optimal  number  of  principal  components  to 

include  in  the  regression  is  a  function  of  n,  p  ,  w  , 

and  IT.  For  several  criterion  variables,  the  relative 

effectiveness  of  prediction  from  the  principal  components 

of  the  predictors  and  from  the  principal  predictors  depends 
2 

on  ir  and  on  the  order  of  dependence  of  the  predictors 
on  th9  principal  predictors. 

The  simulation  calculations  were  compared  with  cal¬ 
culations  in  real  samples;  a  close  correspondence  between 
real  and  simulated  data  was  found.  This  comparison  and 
other  calculations  with  the  simulated  distributions  showed 
that  the  simulation  was  accurate. 
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CHAPTER  1 

INTRODUCTION 

1.1.  Multiple  Regression  and  Cross-Validation 

A  problem  common  to  many  areas  of  psychology  is  the 
prediction  of  a  person’s  score  on  one  variable  from  his 
scores  on  a  number  of  other  variables.  The  variable  that 
is  to  be  predicted  is  called  the  criterion  and  the  other 
variables  are  called  predictors.  Many  methods  have  been 
developed  to  combine  predictor  scores  in  order  to  optimize 
the  prediction  of  the  criterion.  A  common  procedure  is  to 
obtain-  a  sample  of  subjects  with  known  predictor  and 
criterion  scores  (the  derivation  sample)  and  to  calculate 
the  linear  combination  of  the  predictor  scores  that  best 
predicts  the  criterion  scores.  By  "best*5  is  usually  meant 
’’least  squared  error",  which  means  that  the  sum  (over 
subjects)  of  the  squared  deviations  of  the  observed  from 
the  predicted  criterion  score  is  a  minimum.  The  optimizing 
coefficients  of  the  predictor  scores  are  called  the  multiple 
regression  weights  and  are  calculated  from  the  normal 
equations  which  express  the  minimization  conditions 
(Anderson,  1958 ^  Kendall  and  Stuart,  1961). 

When  multiple  regression  is  used  to  compute  predictor 
weights  a  multiple  correlation  may  be  calculated.  The 
multiple  correlation  is  the  Pearson  product-moment  correl¬ 
ation,  in  the  sample,  between  the  optimal  linear  combination 
of  the  predictors  and  the  criterion  variable.  The  multiple 
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correlation  is  thus  a  measure  of  the  degree  of  relationship 
between  the  predictors  and  the  criterion.  However  the 
multiple  correlation:\is  a  biased  estimate  of  this  relation¬ 
ship  and  is  generally  larger  than  the  true  population 
multiple  correlation.  The  bias  occurs  because  the  process 
of  minimizing  the  average  squared  error  in  prediction 
.is  equivalent  to  maximizing  the  correlation  between  the 
linear  combination  of  the  predictors  and  the  criterion. 

Due  to  the  finite  size  of  the  sample,  the  optimizing  linear 
combination  will  be  fitted  to  the  idiosyncracies  of  the 
sample  and  will  generally  result  in  a  higher  multiple 
correlation  than  the  population  multiple  correlation. 

One  problem  in  the  application  of  multiple  correlation 
techniques  is  therefore  the  estimation  of  the  true  multiple 
correlation  from  the  biased  sample  multiple  correlation. 

In  the  next  section  it  will  be  shown  that  there  are  two 
population  correlations  which  must  be  distinguished.  A 
number  of  formulas  for  correcting  the  sample  multiple 
correlation  are  known.  However  these  formulas  require 
assumptions  which  are  often  idifficult  to  satisfy  and 
therefore  many  early  investigators  estimated  the  population 
correlation  by  applying  to  a  second  sample  the  regression 
weights  calculated  in  an  original  sample.  They  found  that 
the  correlation  between  the  regression  function  and  the 
criterion  in  the  second  sample  was  less  than  the  original 
sample  multiple  correlation.  This  technique  became  known 
as  cross-validation  of  the  predictor  weights  or  simply  as 
cross-validation  (Mosier,  19510.  The  correlation  in  the 


second  sample  is  called  the  cross-validity.  The  first 
sample  is  known  as  Uie  derivation  sample,  the  second  is 
the  validation  sample.  An  obvious  addition  to  the  cross- 
validation  method  is  to  repeat  the  calculations  inter¬ 
changing  the  roles  of  the  first  and  second  sample,  fte 
shall  call  this  technique  double  cross-validation. 

This  study  was  designed  to  investigate: 

(a)  the  accuracy  of  the  cross-validity  as  an  estimate  of 
the  population  correlation, 

(b)  the  effectiveness  of  two  reduced  rank  methods  for 
estimating  predictor  weights,  and 

(c)  the  effect  of  the  variation  of  some  parameters  of  the 
population  distribution  on  the  results  of  (a)  and  (b). 

The  estimation  of  the  population  correlation  is  described 
in  more  detail  in  Section  1.2.  The  reduced  rank  methods 
are  introduced  in  Section  1.3*  Finally,  the  study  of  the 
effect  of  variation  of  population  parameters  by  a  simulation 
technique  is  introduced  in  Section  1.4. 

1.2.  Estimates  of  Validity 

JJet  the  predictor  variables  be  x2,  ...,  xn  and 
let  the  criterion  variable  be  y.  Then  the  regression  function 
in  the  population  is 

(1.2.1/  3-^  X-^  t  3 2  x2  •••  ^  • 

The  constant  term  in  the  equation  is  3^;  the  3i  are  called 
the  regression  weights.  Two  models  for  the  predictors  are 
possible,  the  regression  model  and  the  correlation  model 
(Ezekiel  and  Fox,  1959,  pp.  279-281).  In  the  regression 


model,  the  values  of  the  predictor  variables  are  fixed  and 
only  the  criterion  is  a  random  variable.  A  more  realistic 
model  for  most  multivariatenwork  in  psychology  is  to  assume 
that  both  the  predictors  and  the  criterion  are  random 
variables  (the  correlation  model).  Under  the  null  hypothesis 
of  zero  multiple  correlation  the  distributional  theory  is 
identical  for  the  two  models.  However  when  the  null  hypo¬ 
thesis  is  not  true  the  distributions  are  different  under 
the  two  models.  Since  the  distributional  theory  is  much 

more  complicated  under  the  correlation  model,  most  invest- 
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igators  in  psychology  (e.g.  Burket,  196*0  have  continued 
to  use  the  regression  model  hoping  that  there  will  be 
little  practical  difference  between  the  two  models. 

Regression  equations  can  also  differ  in  whether  the 
constant  term,  3Q,  is  included.  In  the  case  of  the  regres¬ 
sion  model,  the  constant  term  is  really  indistinguishable 
from  the  other  terms  in  the  equation  since  a  predictor 
variable,  xQ,  may  be  defined  as  the  constant  1.0.  Then 
the  constant  term  may  be  written  as  BQ  x^.  Therefore, 
formulas  developed  for  the  constant  =  0  case 
(1.2.2)  $1  xx  +  82  x2  +  ...  +  8n  xn 

may  be  modified  for  the  constant  ?  0  case  (1.2.1)  by 
simply  replacing  n  by  n  +  1. 

In  the  correlation  model  this  simple  correspondence 
between  the  zero  and  non-zerc  constant  cases  does  not  hold 

since  x,  ,  ...,  x  are  random  variables  while  xn  is  fixed. 

1  n  0 

Inclusion  of  \a  constant  term  does  not  affect  the  multiple 
correlation  or  the  correlation  between  the  regression 
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function  and  any  other  variable.  Thus  in;;  studies  such  as 
the  preser;  one  which emphasize  correlation  measures,  it 
is  simplest  to  set  the  constant  term  to  zero.  However  if 
the  mean  squared  error  of  prediction  is  used  as  a  measure 
of  accuracy  of  prediction  it  is  very  important  to  state 
whether  the  constant  term  is  included  in  the  regression. 

Let  p  be  the  population  multiple  correlation,  a 
the  population  standard  deviation  of  y,  and  _  ^tbhe 
population  standard  deviation  of  the  error  in  prediction 
(y  r  $)  where  $  is  the  regression  function  (1.2.1)  with 
weights  calculated  from  the  normal  equations.  Then 

(1.2.3)  p2  =  1  -  . 

°y 

A  similar  equation  holds  in  the  sample,  relating  the 

2 

squared  sample  correlation,  r  ,  to  the  mean  squared  error 

of  prediction,  MSE,  and  the  standard  deviation  of  the 

sample,  s  : 

«y 

9  MSE 

(1.2.4)  r  =  1 - £  . 

5y 

2 

The  first  estimation  of  p  in  the  psychological  literature 
(Larson,  1931)  is  by  the  following  formula: 

?  N  2 

(1.2.5)  Est ( p  )  =  1 - (1  -  r  j 

N  -  n 

where  N  is  the  sample  size.  Larson  does  not  give  a  der¬ 
ivation  of  this  formula  but  Wherry  (1931 )  showed  that 


it  follows  (in  the  regression  model)  from  estimating 
a2  by  s2  and  estimating  o2y  _  ^  by  (MSE)  N/(N  -  n). 

The  substitution  of  these  two  estimates  into  (1.2.3)  and 
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the  use  of  (1.2.4)  gives  (1.2.5).  In  order  to  improve 
this  estimate.  Wherry  (1931)  estimated  o'"  by  s'"  N/(N  -  1) 

v  «7 

2 

rather  than  by  s  .  The  resulting  formula  is 

y 

?  N  -  i  p 

(1.2.6)  Est(p'i)  =  1 - (1  -  r  ) 

N  -  n 


Larson  and  Wherry  compared  their  estimates  with  cross¬ 
validities  and  Wherry  showed  that  (1.2.6)  is  superior 


to  (1.2.5). 

It  is  not  entirely  clear  how  Larson  and  Wherry  *. 

handled  the  constant  term  in  the  regression  function. 

Formula  (1.2.6)  is  strictly  applicable  to  a  zero  constant 

term.  When  the  constant  term  is  not  zero,  the  unbiased 

estimate  of  o2y  ^  is  (MSE)  N/(N  -  n  -  1)  so  that  the 
2 

estimate  of  p  is 


(1.2.7)  Est(p2)  =  1 - (1  -  r2) 

N  -  n  -  1 

Formula  (1.2.7)  is  often  referred  to  as  Wherry’s  formula 
even  though  his  original  formula  was  (1.2.6).  Formula 

p 

(1.2.7)  is  not  an  unbiased  estimate  of  p  since  the  ratio 

of  two  unbiased  estimates  is  not  unbiased.  However,  unbiased 
2 

estimates  of  p  are  not  always  desirable,  for,  if  the  true 
2 

p  =  0,  an  unbiased  estimate  must  take  on  both  negative 
and  positive  values  even  though  a  multiple  correlation  is 
always  positive. 

The  multiple  correlation,  p,  is  the  correlation,  in 
the  population,  ^f  the  criterion  and  the  regression  function 
calculated  in  the  population.  In  applications,  the  popul¬ 
ation  regression  function  can  never  be  known  and  one  is 


more  interested  in  how  effective  the  sample  regression 
function  is  in  other  samples.  A  measure  of  this  effect¬ 
iveness  is  r  ,  the  sample  cross-validity.  For  any  given 

V 

regression  function,  r  will  vary  from  validation  sample 

C 

to  validation  sample.  The  average  value  of  r  will  be  approx- 
imately  equal  to  the  correlation,  in  the  population,  of 
the  sample  regression  function  with  the  criterion.  This 
correlation  is  the  population  cross-validity,  p  .  Wherry’s 

C 

formulatffitimates  p  rather  than  p  .  Lord  (1950)  and  Nichol- 

C 

son  (i960)  derived  an  unbiased  estimate  of  the  population 

mean  square  error  of  a  sample  regression  function.  Using 

2 

this  estimate  of  MSE,  an  estimate  of  p  is 

5  N-lN  +  n  +  1  P 

(1.2.8)  Est(p^)  =  1 - —  (1  -  r  ) 

c  N  -  n  -  1  N 

This  formula  applies  to  the  regression  model  with  a  constant 
term.  Darlington  (1967)  modified  this  formula  for  the 
correlation  model  with  a  constant  term.  His  formula  is 

(1.2.9)  Est(p^)  »  1  - 

N  -/1  N-2  N+l  ~ 

-  (1  _  r  )  . 

N-n-lN-n-2  N 

This  formula  is  based  on  the  assumption  that  the  predictors 

and  criterion  have  a  multivariate  normal  distribution. 

It  would  be  possible  to  derive  a  similar  formula  for 
2 

Est(p  )  in  the  multivariate  normal  case  with  no  constant 
0 

term.  It  is  not,  however,  the  purpose  of  this  study  to 

2 

derive  the  best  estimate  of  pQ  for  it  is  not  clear  what 
properties  such  an  estimator  should  have,  particularly 
since  an  unbiased  estimator  has  the  defects  mentioned 
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above.  It  is  more,  interesting  to  study  the  accuracy  of 

the  cross-validity  as  an  estimate  of  p  and  p. 

c 

Returning  to  estimating  p,  Wishart  (1931)  calculated 

p 

the  moments  of  the  distribution  of  r  for  the  multivariate 

p 

normal  distribution.  The  expected  value  of  r  is 

p  N  -  n  -  1  0 

(1.2.10)  E(r  )  =1 - (1  -  q)  x 

N  -  1 

FU,  1,  (N  +  l)/2,  p2) 

where  F(a,  b,  c,  x)  is  the  hypergeometric  function. 

Using  the  first  two  terms  of  the  expansion  of  this  function, 
equation  (1.2.10)  reduces  to 

p  N  -  n  -  1  0 

(1.2.11)  E(r  )  =1 - (1  -  pd) 

N  -  1 

N-n-1  2  „  0 

-  p2  (1  _  p2)  . 

N  -  1  N  +  1 

Olkin  and  Pratt  (1958)  showed  that  an  unbiased  estimate 
of  p2  is 

p  N  -  3  P 

(1.2.12)  Est ( q  )  =  1 - (1  -  rd)  x 

N-n-1 

F(l,  1,  (N  -  n  +  1  )/2,  1  -  r2), 

p 

which,  neglecting  terms  in  1/N",  is 

p  N  -  3  p 

(1.2.13)  Est ( p  ;  =1 - (1  -  rd) 

N-n-1 

N  -  3  2 

-  (1  _  r2)2  . 

N-n-lN-n  +  1 

The  Wherry  estimate  (1.2.7)  is  almost  identical  to  the  first 


two  terms  of  this  series. 

Darlington  (1967)  has  carefully  distinguished  the 
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four  correlations  p,  pc,  r,  and  rc.  The  smallest  of  these, 
p  and  r  ,  are  the  validity  of  the  sample  regression  function 
in  the  population  and  another  sample,  respectively.  The 
average,  over  many  samples.  Of  the  cross-validity,  r  ,  will 
be  approximately  equal  to  p  .  The  next  smallest  correlation 
is  p,  the  population  multiple  correlation  or  the  validity 
of  the  population  regression  function  in  the  population. 

The  largest  correlation  is  the  sample  multiple  correlation, 
r,  which  is  the  validity  of  the  sample  regression  function 
in  the  derivation  sample.  These  relationships  may  be  sum¬ 
marized  as  follows: 

(1.2.14)  E(rc)  *  pc  <  p  <  r  . 

Empirical  confirmation  of  (1.2.14)  is  presented  in 
Section  3-3. 

1.3.  Improvement  of  Prediction 

It  is  well  known  that  adding  predictors  to  a  regres¬ 
sion  equation  increases  both  the  sample  and  population 
multiple  correlations.  However  the  greater  the  number  of 
predictors,  n,  the  more  unstable  are  the  sample  regression 
weights  and  the  lower  are  the  sample  cross-validity,  rQ, 
and  the  population  cross-validity,  p  .  The  decrease  in 

w 

estimated  p  follows  from  (1.2.9) • 

V 

A  second  difficulty  with  a  large  number  of  predictors 
in  multiple  regression  is  that  a  subset  of  them  would 
probably  do  just  as  well,  if  the  subset  could  be  deter¬ 
mined.  For  predictions  in  applied  psychology,  e.g.  per¬ 
sonnel  selection,  it  is  undesirable  to  have  to  make  a 
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large  number  of  mtci^-i.  ^„*nts  on  each  individual  in  order 
to  make  accurate  predictions.  Furthermore,  the  weights 
for  a  subset  of  predictors  would  be  more  stable  in  future 
samples  due  to  the  smaller  h. 

There  are  several  ways  to  select  a  subset  of  predictors. 
The  best  selection  procedure  is  stepwise  regression  in 
which  predictors  are  added  to  the  regression,  one  at  a 
time,  until  there  is  no  significant  additional  prediction. 
Other  selection  procedures  are  shown  by  Darlington  (1967) 
to  be  inferior  tc  the  stepwise  method. 

Another  way  to  reduce  the  number  of  predictors  in  the 
regression  function  is  to  use  a  few  linear  combinations 
of  the  predictors  rather  than  the  predictors  themselves. 

Two  such  methods,  called  reduced  rank  methods,  are  con¬ 
sidered  in  this  study.  In  the  first  method  (Horst,  19^1), 
the  largest  principal  components  of  the  predictors  (Anderson, 
1958)  are  entered  in  the  regression  function.  Since  the 
principal  components  may  be  expressed  as  linear  combinations 
of  the  predictors,  the  regression  function  may  be  trans¬ 
formed  to  a  linear  combination  of  the  predictors.  Hence 
the  full  set  of  predictor  variables  is  used  but  only  through 
the  intermediary  of  a  few  principal  components.  These 
components  may  be  interpreted  psychologically  and  it  may 
be  possible  to  select  predictors  loading  highly  on  the 
components  as  a  subset  to  use  in  future  prediction.  In 
this  way  reduced  rank  prediction  can  lead  to  a  reduction  of 
the  size  of  the  predictor  battery. 
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In  his  19*41  paper,  Horst  also  suggested  that  the 
predictors  could  be  represented  as  a  Linear  function  of 
common  and  unique  factors  rather  than  as  a  linear  function 
of  the  principal  components.  This  factor  analytic  model 
is  more  difficult  to  treat  because  of  the  difficulty  of 
estimating  the  factors  as  linear  combinations  of  the 
predictors.  Unlike  Horst’s  first  method,  factor  analysis 
is  not  a  reduced  rank  method.  The  factor  analytic  model 
for  regression  calculations  was  studied  by  Leiman  (1951 ) 
with  some  success  but  will  not  be  considered  further  in 
this  study. 

Before  outlining  the  second,  reduced  rank  procedure, 
let  us  consider  a  study  by  Burket  (196!* )  comparing  a  number 
of  regression  methods  in  a  large  data  sample.  He  compared 
two  stepwise  selection  procedures  (Efroymson,  I960;  Horst 
and  MacEwan,  I960),  the  largest  principal  components  method, 
the  smallest  principal  components  method  (Guttman,  1958) 
and  the  criterion-related  principal  components  method 
(Hotelling,  1957;  Massy,  1965).  Guttman  proposed  the  use 
of  the  smallest  principal  components  since  the  solution  for 
the  multiple  regression  weights  depends  on  the  inverse 
of  the  predictor  intercorrelation  matrix,  and  the  largest 
components  of  the  inverse  are  the  smallest  components  of 
the  original  matrix.  Hotelling  and  Massy  suggested  that 
the  principal  components  which  are  entered  into  the  regres¬ 
sion  function  should  be  those  components  correlating  max¬ 
imally  with  the  criterion  rather  than  those  of  largest 
variance.  Burket  compared  these  five  methods  for  several 
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criteria  and  in  several  subsamples  of  his  total  sample. 

He  found  that  the  largest  principal  components  method  was 
consistently  superior  to  the  other  four  methods.  One  purpose 
of  the  present  study  is  to  show  under  what  conditions  this 
superiority  can  be  expected  to  hold. 

The  second  reduced  rank  method,  prediction  from  the 
principal  predictors,  was  developed  from  the  following 
considerations.  The  principal  components  of  the  predictors 
may  not  be  highly  related  to  the  criterion  since  the  com¬ 
ponents  are  determined  solely  from  the  intercorrelations 
of  the  predictors.  It  would  be  desirable  to  find  linear 
combinations  of  the  predictors  which  are  strongly  related 
to  the  criterion.  The  Hotelling  and  Massy  method  employed 
by  Burket  finds  these  linear  combinations  by  computing 
the  correlation  of  each  principal  component  with  the 
criterion  and  entering  into  the  multiple  regression  only 
those  components  with  the  highest  correlations .  However 
a  more<effective  procedure  might  be  to  find  those  linear 
combinations  of  the  predictors  (not  necessarily  the  principal 
components  of  the  predictors)  which  are  maximally  correl¬ 
ated  with  the  criterion. 

In  the  single  criterion  case,  this  problem  is  trivial 
since  there  is  only  one  linear  combination  of  the  predictors 
maximally  correlated  with  the  criterion  and  all  other 
orthogonal  combinations  are  uncorrelated  with  the  criterion. 
This  combination  is  simply  the  regression  function,  i.e. 
the  predicted  criterion,  using  multiple  regression  on  all 
the  predictors.  Therefore,  in  the  single  criterion  case. 


13 

nothing  new  is  found  by  considering  linear  combinations  of 


the  predictors  maximally  correlated  with  the  criterion. 

Consider,  however,  prediction  of  several  criteria 

from  a  common  set  of  predictors.  Examples  of  such  multiple 

criteria  are  the  prediction  of  success  in  several  academic 

curricula  by  using  a  t\ttery  of  aptitude  tests  or  the 

prediction  of  a  number  of  social  criteria  using  scales 

from  a  personality  test  (Hase  and  Goldberg,  1966).  In 

particular,  let  us  suppose  that  we  wish  to  predict  each 

of  the  criteria  equally  well.  Then  Tucker  (1957)  has 
r 

developed  a  method  which  discovers  those  linear  combinations 
of  the  predictors  maximally  related  to  the  set  of  criteria. 
These  combinations  are  called  the  principal  predictors. 

The  largest  of  the  principal  predictors  may  be  entered 
into  the  regression  equations  for  each  criterion.  The 
principal  predictors  have  the  property  that,  for  a  fixed 
number  of  linear  combinations  entered  into  each  regression, 
the  average  squared  multiple  correlation  is  greater  for 
the  principal  predictors  than  for  any  other  linear  com-, 
binations  entered  into  the  regression. 

The  principal  predictors  were  developed  by  Tucker 
as  a  convenient  way  to  summarize  a  large  number  of  pre¬ 
dictor  scores  by  a  few  criterion-related  predictor  scores. 
The  principal  predictors  also  provide  a  useful  concep¬ 
tualization  of  the  relationship  of  a  set  of  predictors 
to  a  set  of  criteria.  In  the  present  study,  on  the  other 
hand,  the  principal  predictors  are  compared  with  the 
principal  components  as  reduced  rank  prediction  methods. 


In  any  prediction  calculation*  each  criterion  variable 
may  be  divided  into  two  parts — one  part  is  predictable 
from  the  set  of  predictors  and  the  ether  part  is  unpre¬ 
dictable  from  these  predictors.  When  there  are  several 
criteria  the  predictable  parts  of  the  criteria  are  them¬ 
selves  a  set  of  variables  which  have  principal  components. 
These  principal  components  are  the  principal  predictors. 

It  is  important  not  to  confuse  the  principal  components 
of  the  predictors,  previously  discussed,  with  the  prin¬ 
cipal  components  of  the  predictable  parts  of  the  criteria, 
which  are  called  the  principal  predictors.  The  largest 
principal  predictor  accounts  for  the  largest  portion  of 
the  predictable  variation  in  the  criteria.  The  next 
largest  principal  predictor  accounts  for  the  next  largest 
portion,  and  so  on.  Therefore  a  few  of  the  principal 
predictors  account  for  most  of  the  predictable  variation 
in  the  criteria. 

The  largest  principal  predictors  may  be  used  as  pre¬ 
dictors  themselves.  Then  the  principal  predictors,  Mke 
the  principal  components  of  the  predictors,  can  be  expressed 
in  terms  of  the  original  predictors.  The  weights  for  the 
original  predictors  can  therefore  be  calculated.  Also, 
the  principal  predictors  may  be  interpreted  psychologically, 
which  may  lead  to  greater  understanding  of  the  relation¬ 
ship  between  the  predictors  and  the  criteria.  The  principal 
predictors,  unlike  the  principal  components  of  the  predict¬ 
ors,  are  criterion-related  so  that  variation  in  the  predict¬ 
ors  which  is  unrelated  to  the  criteria  is  not  represented 
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in  the  principal  predictors.  In  some  cases,  the  predictor 
variation  which  is  unrelated  to  the  criteria  could  be  large 
enough  to  dominate  the  principal  components  of  the  pre¬ 
dictors.  But  this  variation  is  not  useful  for  prediction. 
This  is  the  reason  that  prediction  from  the  largest  principal 
predictors  may  be  superior  to  prediction  from  the  largest 
principal  components. 

The  two  methods,  prediction  from  the  principal  compon¬ 
ents  of  the  predictors  and  from  the  principal  predictors, 
are  called  reduced  rank  methods  since,  in  both  cases,  a 
correlation  matrix  may  be  approximated  by  a  matrix  of 
lower  rank  using  the  largest  principal  components  or  the 
largest  principal  predictors.  In  the  first  method,  the 
correlation  matrix  of  the  predictors  is  approximated  while 
in  the  second  method,  the  correlation  matrix  of  the  pre¬ 
dictable  parts  of  the  criteria  is  approximated.  These 
statements  are  made  more  precise  in  Chapter  2. 

Of  the  prediction  methods  discussed  in  this  section 
only  three  will  be  considered  further  in  this  study: 

(a)  prediction  from  the  full  set  of  predictors, 

(b)  prediction  from  the  largest  principal  components  of 
the  predictors,  and 

(c)  prediction  from  the  largest  principal  predictors. 

The  last  method  is  possible  only  when  there  are  several 
criteria.  The  first  two  methods  may  be  used  for  one  or 


several  criteria. 
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1.4.  The  Comparison  of  Prediction  Methods 

In  order  to  evaluate  and  compare  the  prediction 
methods  described  in  the  preceding  section  it  would  be 
desirable  to  employ  mathematical  techniques.  However  the 
problems  are  so  complex  that  multivariate  statistical 
theory  is  unable  to  solve  most  of  them. 

Another  approach  to  these  problems  has  been  to  apply 
the  different  prediction  methods  to  a  common  body  of  data 
and  to  compare  the  results  (Burket,  1964;  Leiman,  1951). 
There  are  definite  advantages  to  this  approach.  Any 
conclusions  are  based  on  real  data  and  do  not  depend  on 
the  assumptions  in  a  theoretical  development  being  valid. 
However,  there  is  a  major  drawback  to  such  empirical 
techniques.  If  two  or  more  studies,  using  different  data, 
disagree  in  their  conclusions,  it  is  difficult  to  determine 
what  properties  of  the  data  sets  differed  enough  between 
the  studies  to  produce  the  varied  conclusions.  Similarly, 
it  is  difficult  to  evaluate  the  generality  of  conclusions 
found  in  a  single  study  using  one  set  of  data. 

It  is  therefore  desirable  to  compare  the  prediction 
methods  on  a  wide  variety  of  data  sets,  differing,  in  a 
known  way  in  certain  parameters.  Since  it  is  hard  to  sat¬ 
isfy  this  condition  with  real  data,  it  is  proposed  that  some 
useful  conclusions  may  be  made  from  the  study  of  artificial 
or  simulated  data  sets.  Such  data  sets  can  be  readily 
generated  on  a  computer.  The  parameters  specifying  the 
properties  of  a  data  set  can  be  input  to  the  computer  and 
a  wide  variety  of  data  sets  can  be  generated  by  varying 
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these  input  parameters.  The  prediction  methods  can  then 
be  compared  in  these  data.  Such  a  simulation  procedure 
is  described  in  this  study. 

The  simulation  experiments  consist  of  four  states  of 

calculations: 

* 

Generation  of  the  model.  A  combined  predictor 
and  criterion  population  covariance  matrix  is  generated 
subject  to  certain  input  parameters.  The  population  model 
and  its  parameters  are  described  in  Sections  2.2  and  2.3. 

In  the  model  the  predictors  and  criteria  are  expressed 
in  terms  of  the  principal  predictors. 

Generation  of  two  samples .  Two  samples,  each  of 
size  N,  are  obtained  from  the  population  generated  in  the 
preceding  stage.  The  samples  are  obtained  by  generating 
sample  covariance  matrices;  the  method  is  outlined  ir. 

Section  2.4.  The  two  samples  are  used  in  double  cross- 
validation. 

Calculation  of  predictor  weights .  The  predictor 
weights  are  calculated,  in  each  sample,  by  one  or  more 
of  the  following  methods — (a)  multiple  regression  on  the 
predictors,  (b)  multiple  regression  on  the  principal  com¬ 
ponents,  and  (c)  multiple  regression  on  the  principal 
predictors.  The  calculation  of  these  weights  is  described 
in  Sections  2.5,  2.6,  and  2.7,  respectively. 

Cross-validation  of  the  weights .  The  weights  for  each 
sample  (and  each  method)  are  cross-validated  on  the  other 

•"r.ple  (r  )  and  on  the  population  itself  (p  ).  The  for- 
c  c 

mu'ias  for  the  validities  are  presented  in  Section  2,8. 
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CHAPTER  2 

THE  MATHEMATICAL  MODEL  AND  SAMPLE  CALCULATIONS 

2.1.  Notation 

Scalars  are  denoted  by  lower  case  letters  (m,  p). 

The  only  exceptions  to  this  convention  are  N  for  sample 
size  and  the  elements  of  matrices.  Scalars  may  be  either 
numbers  or  random  variables.  Column  vectors  are  denoted 
by  lower  case  underlined  letters  (x,  a).  These  vectors 
may  be  either  random  variable  vectors  or  vectors  of  numbers. 
Row  vectors  are  transposed  column  vectors,  transposition 
beingz'jndicated  by  priming  (x* ,  a’)  .  Matrices  are  denoted 
by  upper  case  letters  (A,  I).  Transposed  matrices  are 
indicated  by  a  prime  (A1,  £’).  The  (fc,  j)  element  of  a 

o 

matrix  is  denoted  by  k^y  The  identity  matrix  is  I. 

The  matrix  consisting  of  the  first  t  columns  of  a 
matrix  A  is  denoted  by  (A)^.  The  first  t  rows  of  A  are  t(A). 
The  vector  consisting  of  the  first  t  elements  of  a  vector  b 
is  denoted  by  t(b). 

The  population  covariance  matrix  of  two  random  vectors 

x  and  £  is  denoted  by  E  .  The  corresponding  sample 

covariance  matrix  is  C ^ .  When  ^  is  known  to  have  only  one 

component,  the  covariance  matrices  are  column  vectors 

denoted  by  a  and  c  .  The  variance  of  a  scalar  y  is 
xy  xy 

denoted  by  a  or  c  .  The  abbreviation  Var(  )  is  used 

i7  v  o  *J 

to  denote  the  variance  of  the  random  variable  enclosed 


in  parentheses. 
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The  Greek  letters  E,  o,  p,  and  n  are  regularly  used 
to  denote  population  parameters.  E  and  a  are  population 
covariances  as  stated  above,  p  denotes  a  population 
multiple  correlation,  it  is  a  population  parameter  defined 
in  Section  2.2.  8  and  ft  are  also  used  as  population 

parameters.  All  other  population  parameters  are  denoted 
by  Latin  letters  as  are  all  sample  quantities. 

The  univariate  normal  distribution  with  mean  =  m  and 
variance  =  v  is  denoted  by  N(m,  v).  The  multivariate 
normal  distribution  with  mean  vector  »  a  and  covariance 
matrix  =  E  is  represented  by  N(a,  E).  Fisher's  F  distribu¬ 
tion  with  n^  and  n2  degrees  of  freedom  is  denoted  by 
Ffn-^,  n2).  Finally,  the  chi  distribution  with  n  degrees 
of  freedom  is  denoted  by  x(n) .  This  is  the  square  root  of 
the  chi-squared  distribution. 

A  list  of  symbol  definitions  appears  in  Appendix  F. 


2.2.  The  General  Model  for  Predictors  and  Criteria 

Let  x  (n  components)  be  n  random  variables,  called 
predictors,  and  let  £  (m  components)  be  m  random  variables, 
called  criteria.  Let  m  be  less  than  n  as  is  usually  the 
case  in  practice.  For  convenience,  normalize  all  variables 
x  and  £  to  unit  variance.  Let  x  and  have  a  joint  multi¬ 
variate  normal  distribution  with  null  mean  vector  and 
arbitrary  covariance  matrix 
f  e  E 

(2.2.1)  E  =  xx  xy 


^2yx  2yyJ 

It  is  shown  in  Appendix  A  that  x  and  can  be  written  in 
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a  special  way  in  terms  of  (n  +  m)  independent  unit  variance 
random  variables  w.  Let  w  be  partitioned  as 
(2.2.2)  w»  =  (w|  Wp  wp 

where  has  m  components,  Wj  has  (n  -  m)  components  and 


w^  has  m  components.  Then 

m  (n-m)  m 


n 

X 

[S,  S~  0] 

(2.2.3) 

= 

1  d 

m 

X 

F  0 

*1 

—2 

-3 


(The  number  of  rows  or  columns  in  the  partitioned  matrices 
are  appended  to  the  matrix  expression  above). 

Thus  the  submatrices  forming  I  may  be  written  as 


(2.2.4) 

Exx  ■ 

si  si  +  s2 

(2.2.5) 

Exy* 

I *  =  s,  F* 

yx  1 

(2.2.6) 

£yy- 

F  F*  +  E  E' 

This  representation  of  x  and  £  can  be  understood  in 
the  following  way.  Let  each  of  the  £  variables  be  pre¬ 
dicted  from  a  linear  combination  of  the  x  variables. 

The  best  least  squar<  s  prediction  (multiple  regression) 
of  is 

(2.2.7)  t  -  B-  X.  -  E-l  X  . 

The  weight  matrix  is  written  as  B*  rather  than  B  so  that 
when  m  ■  1,  B'  =  b ' ,  the  transpose  ox’  a  column  vectoi  . 

Each  component  of  the  predictable  part  £  is  the  best 
predictor  of  the  corresponding  component  of  £.  The  squared 
multiple  correlation  of  the  criterion  y^  with  the  n 
predictors  is  then  (recall  that  y^  has  unit  variance) 

(2.2.8)  =  Var(^)  =  (B'  Zxx  B)  j 


21 


Now  the  variables  £  may  be  transformed  to  independent 
unit  variance  variables  by 

(2.2.9)  l  -  F 

where  F  is  orthogonal  by  columns,  so  that 

(2.2.10)  F’  F  =  D2  (diagonal) 

2 

with  the  diagonal  elements  of  D  in  descending  order. 

The  m  variables  w^  are  the  principal  predictors  (see  Appen¬ 
dix  A,  equations  (A. 6)  to  (A.10)).  The  principal  predictors 
are  numbered  in  order  so  that  the  first  accounts  for  the 
largest  proportion  of  the  variance  of  the  £  variables, 
the  second  accounts  for  the  next  largest  proportion,  and 
so  on. 

On  referring  back  to  the  model  (2.2.3)  one  sees 
that  the  criteria  £  are  written  as  the  sum  of  a  linear 
transformation  of  the  principal  predictors  w^  and  a  trans¬ 
formation  of  m  other  independent  variables  w~.  Similarly, 
the  predictors  x  are  written  as  the  sum  of  a  linear  trans¬ 
formation  of  the  same  principal  predictors  w^  and  a  trans¬ 
formation  of  (n  -  m)  other  independent  variables  w2.  The 
association  between  £  and  x  is  expressed  through  their 
dependence  on  the  principal  predictors  The  non-assoc- 

iated  parts  of  x  and  £  are  expressed  in  terms  of  indep¬ 
endent  variables  w2  (for  x)  and  (for  ^) .  The 
total  set  of  variables  w*  =  (w|  w^)  are  independent 

unit  variance  normal. 

The  matrices  and  F  are  central  to  the  description 
of  the  dependence  of  the  criteria  on  the  predictors. 

Let  us  first  consider  F.  The  dependence  of  the  m  criteria  jr. 
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on  the  k  principal  predictor  is  given  by  the  sum  of 

the  squares  of  the  elements  in  the  kth  column  of  F.  Let 

2  t  h  p 

this  quantity  be  D f.  (k^  element  of  the  diagonal  matrix  D^): 


(2.2.11)  Df 


I  F 


Jk  * 


Dkk  is  the  k  eigenvalue  of  =  F  F'  (see  Appendix  A, 

p 

equations  (A. 6)  and  ( A . 7 ) )  -  The  quantities  Dkk  are  thus 

monotonically  decreasing  numbers.  If,  in  a  certain  populat- 

o 

ion,  the  first  eigenvalue  is  very  large  and  the  others 
small,  this  indicates  that  most  of  the  prediction  of  £ 
from  x  is  derived  from  only  one  linear  combination  of  the 


x  variables,  namely  the  first  principal  predictor.  On 

2 

the  other  hand,  if  several  of  the  Dkk  are  large,  then 
several  independent  linear  combinations  of  the  predictors 


are  needed  in  order  to  get  maximum  prediction  of  £  from  x. 

The  average,  over  criteria,  of  the  squared  multiple 

2 

correlations  is  simply  related  to  the  Dkk  (see  equation 
(A. 12)): 


p  rn  p  m  p 

(2.2.12)  pd  =  (1/m)  I  P7  =  (1/m)  l 

j=l  J  k=l  KK 

The  relation  of  the  predictors  x  to  the  principal 

predictors  through  the  matrix  Sn  is  quite  independent 

p 

of  the  matrix  F  and  the  quantities  Dkk>  For  convenience, 
let  us  suppose  that  the  variables  are  normalized  to  unit 
variance.  Let  us  define  the  (n  x  n)  matrix  S  as  the 
super-matrix 

(2.2.13)  S  =  (S1  S2) 
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Thus 


(2.2.24) 


Since  the  w  variables  are  independent  and  of  unit  variance, 
the  sum  of  squares  of  each  row  is  1.0.  The  sum  of  squares 

X.  L. 

of  the  i  row  of  is  then  less  than  1.0  and  represents 

f~  h 

the  magnitude  of  the  dependence  of  the  i°  predictor  on 

2 

the  principal  predictors.  Let  qk  (k  =  1,  ...,  m)  denote 
the  sum  of  squares  of  the  k  column  of  S: 

(2.2.15)  q£  =  I  S2ik  . 

2 

qk  is  a  measure  of  the  average  dependence  or  relation  of 

1.U  Q 

the  x  variables  to  the  ks  principal  predictor.  The  qk 

2 

are  analogous  to  the  Dkk  since  they  represent,  respectively, 
the  average  dependence  of  the  predictors  and  the  criteria 

i.  u 

on  the  k  principal  predictor. 

2  2 
We  may  average  the  qk  in  the  same  way  as  the  Dkk  are 

averaged  in  (2.2.12): 


(2.2.16) 


m 


=  (1/n)  l 
k=l 


qk  =  (1/n) 


m  n 

y  v 

L  L 
k=l  i=l 


'ik 


Note  that  the  division  is  by  n,  not  m,  in  order  to  get  a 

2  2 
parameter  tt  °  with  maximum  value  1.0.  tr  represents  the 

average 3  over  predictors,  of  the  predictor  variance  related 

to  the  principal  predictors.  It  is  a  measure  of  the 

dependence  of  the  predictors  on  the  principal  predictors. 

2 

For  brevity,  ir  will  be  called  the  average  criterion- 

related  predictor  variance.  This  description  does  not 
2 

imply  that  tt  is  an  average  multiple  correlation  of  x 


predicted  from  £  (roles  of  prdictors  and  criteria  reversed). 
2 

7r  is,  however,  the  average  multiple  correlation  of  the  x 

variables  predicted  from  the  principal  predictors  w^. 

2 

Another  way  to  interpret  tt  is  to  define  &  as  the 
parts  of  x  linearly  predictable  from  w^: 

(2.2.17)  $  =  si  Wi 

t  h 

Then  the  variance  of  5^,  the  predictable  part  of  the  l 
predictor,  is 

HI  p 

(2.2.18)  Var($.)  =  J  S* 

1  k=l  1K 

The  average  of  these  variances  is 
?  n 

(2.2.19)  tt  =  (1/n)  l  Var (S, ) 

i=l  1 


n  in  ^  m  ^ 

=  (l/n)  l  l  Sjk  =  (1/n)  I  q£  . 
i=l  lc=l  ^ 

Consider  now  two  populations,  each  with  the  same 

2 

average  squared  multiple  correlation  o  .  One  population 

? 

might  have  a  small  value  of  ir~  and  the  second  a  large  value 
2 

of  ir  .  In  the  first  population  the  predictors  depend  -w 
very  little  on  the  principal  predictors  -while  in  the  second 
the  dependence  is  greater.  Nevertheless  the  prediction  of 
the  criteria  is  the  same  in  each  population.  This  para¬ 
doxical  situation  can  be  understood  by  first  noting  that 
we  are  considering  for  the  moment  prediction  in  the  popula¬ 
tion,  not  in  finite  samples.  The  prediction  of  jr  is  solely 
via  the  principal  predictors  and  the  random  vector 
is  an  exact  linear  combination  of  the  predictors  x  since 


S  is  square  and  non-singular: 


(2.2.20) 


As  additional  verification  that  the  average  multiple 

2  2 

correlation  is  independent  of  tt  ,  note  that  p  depends 

only  on  F  in  (2.2.11)  and  (2.2.12). 

2 

The  parameter  it  y;111  have  an  effect  on  prediction 

in  finite  samples  however.  Consider  prediction  from  the 

2 

principal  components  of  the  predictors.  When  it  is  large, 

the  largest  principal  component  will  be  largely  in  the 

space  of  the  principal  predictors  and  will  contribute  to 

2 

prediction.  However,  when  it  is  small,  the  first  principal 

component  will  be  unrelated  to  the  principal  predictor 

space  and  will  be  a  very  poor  predictor.  The  effectiveness 

of  prediction  from  the  principal  components  will  thus 

2 

depend  on  the  size  of  tt  .  Empirical  confirmation  of  this 
phenomenon  will  be  demonstrated  in  Section  3-4. 


2.3*  Computer  Generation  of  the  Model 

Since  any  set  of  n  predictors  and  m  criteria  may  be 

written  in  the  form  (2.2.3),  it  is  possible  to  generate 

an  arbitrary  population  distribution  by  specifying  the 

matrices  S,  F,  and  E.  The  covariance  matrices  E  ,  Z  , 

xx  xy 

and  E  may  then  be  calculated  by  (2.2.4)  to  (2.2.6)  where 

yy 

S  is  related  to  S^  and  S^  by  (2.2.13). 

Rather  than  allowing  the  matrices  S,  F, and  E  to  be 
completely  arbitrary,  a  few  basic  parameters  may  be  fixed 
arbitrarily  and  the  matrices  then  generated  essentially 
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randomly  subject  to  these  given  parameters.  These  arbi¬ 
trary  parameters  are  called  "input  parameters”  since  they 
are  input  to  the  computer  program  that  generates  the  model. 
The  major  input  parameters  are: 


1. 

n  =  number  of  predictors. 

2. 

m  =  number  of  criteria. 

3. 

2 

(Dkk,  k  *  1,  ...,  m),  the  eigenvalues 

of 

4. 

2 

(q^,  k  =  1,  ...,  m),  the  dependencies 

of 

the 

predictors  on  the  principal  predictors. 

2  2 

Note  that  once  and  q^  are  specified  for  all  k, 

2 

the  average  squared  multiple  correlation  p  and  the  average 

2 

criterion-related  predictor  variance  ir  are  fixed  by 

(2.2.12)  and  (2.2-16).  In  particular,  when  m  =  1  as  in 

Chapter  3,  p2  *  and  t2  -  (1/n)  q^. 

Two  additional  parameters,  related  to  n  and  m,  are: 

la.  n  =  number  of  columns  of  S. 
s 

2a.  md  =  number  of  duplicate  criteria. 

In  the  model  described  in  Section  2.2,  S  =  (S-^  S^)  is  an 

(n  x  n)  square  matrix.  3-^  has  m  columns  and  S2  has  (n  -  m) 

columns.  In  some  of  the  experiments  described  in  Sections 

3.1  and  3*2,  S  has  more  than  n  columns,  namely  n  columns, 

s 

so  that  S2  has  (ng  -  m)  columns  and  S^  still  has  m  columns. 

All  the  experiments  in  Chapter  3  involve  only  one 
criterion  (m  =  1).  However  several  duplicate  criteria  are 
allowed  and  the  number  of  such  duplicate  criteria  is 
denoted  by  m^.  Duplicate  criteria  are  described  further 
in  Chapter  3  and  Appendix  D. 


27 


Pour  minor  parameters  are  needed  to  complete  the 
input  for  the  model: 

5.  vx  =  variance  of  the  generated  Var^). 

6.  ex  *  tolerance  on  this  variance. 

7.  v  *  variance  of  the  generated  multiple 

y 

p 

correlations  . 

8.  e  =  tolerance  on  this  variance. 

y 

The  computer  program  generates  matrices  F  and  E  so 

2 

that  the  m  squared  multiple  correlations  calculated 

from  them  have  mean  exactly  equal  to  the  average  of  the  D* 

and  variance  equal  to  v  within  a  maximum  error  of  e  . 

j  <y 

That  is, 

„  m  0  m  9 

(2.3-1)  P  =  (1/m)  l  pf  =  (1/m)  j  D2 

J=1  3  k=l  KK 

and 

(2.3.2)  |Var(p|)  -  vy|  =  |(l/m)  J  (p2  -  p2)2  -  vy| 

J  *  1 

<  e 

y 

Matrix  S,  composed  of  S1  and  S2,  is  generated  so  that  the 
mean  of  the  variances  of  calculated  from  is  exactly 

m  p 

equal  to  (1/n)  £  q.  and  the  variance  of  these  vax'iances 


is  equal  to  v  within  a  maximum  error  of  e  .  That  is, 

a  n  m  q 

(2.3.3)  ir  =  (1/n)  l  Var(£.)  =  (1/n)  l  of. 

1=1  1  k=l  K 


and 

(2.3.^)  |Var[Var(X1)]  -  vx |  <  ex 

The  two  occurrences  of  "Var"  in  the  preceding  formula 


refer  to  different  types  of  variances.  Var  (5^)  means  the 
variance  of  the  random  variable  in  the  population. 

Let  the  constants  Var (St^)  ~  v^  temporarily.  Then  Vartv^) 
is  simply  shorthand  notation  for 

(2.3.5)  Var (v. )  =  (1/n)  £  (v,  -  tt2)2  . 

1  i=l  1 

p 

Note  that  tt  is  the  mean  of  the  v^. 

Generation  of  F 

The  matrix  F  is  the  product  of  an  orthonormal  matrix 
V  and  a  diagonal  matrix  D  consisting  of  the  square  roots 

p 

of  the  eigenvalues  Dkk  (equation  (A. 8)): 

(2.3.6)  F  »  V  D 

2 

The  matrix  D  is  input  so  that  generation  of  F  reduces  to 


the  generation  of  an  orthonormal  V  satisfying  the  two 
restrictions  (2.3.1)  and  (2.3.2)  on  the  squared  multiple 
correlations  p y  Pj  may  be  expressed  in  terms  of  V  and  D  by 

(2.3.7)  Pj  =  Var(Jj)  =  (F  F’)^  =  (V  D2  V’)^ 

r  tt2  ^2 


=  JlVJkDkk  • 


When  p^  is  calculated  in  this  way  with  V  orthonormal, 
equation  (2.3.1)  is  automatically  satisfied.  The  restrict¬ 


ions  on  V  are  then 


(a)  V  must  be  orthonormal, 

2 

(b)  all  Pj  calculated  from  (2.3.7)  must  be  less  than 
1.0,  and 

2 

(c)  the  variance  of  the  must  satisfy  equation 

(2.3.2). 


An  algorithmic  procedure  to  generate  a  V  satisfying  these 
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three  conditions,  for  arbitrary  parameters  m,  (Dj^,  k  =  1, 
...»  m),  v  ,  and  e  ,  is  outlined  in  Appendix  C. 

y  y 

Generation  of  E 

The  elements  of  the  (m  *  m)  matrix  E  are  first  gener¬ 
ated  randomly  from  N(0,  1)  and  then  the  rows  of  E  are 
normalized  so  that,  for  all  j, 

(2.3.8)  (E  E').,  =  l  E; .  =  1  -  pf 


where  the  are  calculated  from  (2.3.7).  This  normali¬ 
zation  ensures  that  the  criteria  £  are  normalized  to  unit 
variance.  The  methods  used  to  generate  normal  random 
numbers  as  well  as  other  random  numbers  discussed  in  this 
chapter  are  given  in  Appendix  B. 

Generation  of  S 

Two  different  methods  for  generating  S  were  developed 
for  the  experiments  described  in  Chapters  3  and  4.  The 
first  is  called  the  ex  =  0  metnod  since  the  variance  of 
Var (5?i)  is  exactly  equal  to  vx.  This  method  was  used  for 
the  single  criterion  cise  (m  =  1)  in  Chapter  3.  The 
method  does  not  generalize  well  to  the  m  >  1  case  and 
therefore  a  second  method,  called  the  e  f  0  method  was 

A 

used  for  the  several  criterion  calculations  in  Chapter  4. 
This  latter  method  could  have  been  employed  in  Chapter  3 
except  that  e  cannot  be  set  to  zero  in  this  method. 

X 

? 

When  in  =  1,  tt  can  be  calculated  directly  from  the 
input  parameters  as 
(2.3.9)  tt2  =  (1/n) 

Then,  in  the  e  =  0  method,  n  numbers  are  generated  randomly 
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p 

from  N ( 7i  ,  v  )  subject  to  the  restriction  that  no  number 
be  more  than  1.0  or  less  than  0.0.  The  numbers  are  re- 

? 

scaled  after  generation  so  that  their  mean  is  exactly  tt 
and  their  variance  is  vx.  If  one  of  the  numbers  is  now 
more  than  1.0  or  less  than  0.0,  the  number  is  discarded 
and  a  new  attempt  is  made  to  satisfy  the  conditions.  These 
n  numbers  are  the  variances  of  5^,  denoted  by  vi  =  Var(^i) 
in  equation  (2.3.5).  When  this  step  is  completed  equations 
(2.3-3)  and  (2.3.4)  are  exactly  satisfied  with  ex  =  0. 

Now  S1  is  a  single  column,  s^,  and  its  elements  are 
defined  as  the  square  roots  of  (v*,  i  =  1,  ...»  n),  with 
their  signs  chosen  randomly.  The  elements  of  the  last 
(n  -  1)  columns  of  S,  namely  S5,  are  generated  randomly 
from  N(0,  1)  and  then  rescaled,  by  rows,  so  that  the  row 
sum  of  squares  of  the  whole  S  matrix  is  unity.  This 
rescaling  ensures  that  all  x  variables  have  unit  variance. 
The  generation  of  S  by  the  ex  =  0  method  is  now  complete. 
The  e  ?  0  method  of  generating  S  is  very  similar 

X 

to  the  method  of  generating  F.  In  analogy  to  (2.3.6)  in 
which  all  matrices  are  (m  x  m),  is  written  as 

(2.3.10)  S1  =  T  Q 

where  S-^  and  T  are  (n  *  m)  and  Q  is  diagonal  (m  *  m)  with 
diagonal  elements  =  (q^,  k  =  1,  ...,  m).  T  (like  V)  is 
orthonormal  by  columns.  The  Var(£^)  may  be  written  in 
terms  of  T  and  Q  as 

(2.3.11)  Var(X1)  =  S')^  =  (T  Q2  T’)^  =  |  T2R  q2 


J 


aM^^-ia  iiriiiiiii  ii.'.ii.ra...  <  ••  •  .  - 
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Since  T  is  orthonormal  it  follows  that  the  average  of 

m  ^  o 

these  variances  is  exactly  (1/n)  £  q,  =  tt  so  that  (2.3.3) 

k=l  K 

is  exactly  satisfied.  The  remaining  restrictions  on  T 
(as  on  V)  are  then 

(a)  T  must  be  orthonormal  by  columns, 

(b)  all  Var(£^)  calculated  from  (2.3.11)  must  be 
less  than  1.0,  and 

(c)  the  variance  of  the  Var(£^)  must  satisfy  (2.3.M. 

The  algorithmic  procedure  for  generating  V  (Appendix  C) 

may  be  also  used  to  generate  a  T  satisfying  the  above  three 

2 

conditions  for  arbitrary  n,  m,  (q,  ,  k  =  1,  . ..,  m),  v  , 

K  X 

and  ex. 

After  S1  is  generated,  $2  (n  *  (ng  -  m))  is  generated 
in  the  same  way  as  in  the  e  =  0  method.  First  the  ele- 
ments  of  S2  are  generated  as  N(0,  1)  random  numbers. 

The  rows  of  S 2  are  rescaled  so  that  each  row  sum  of  squares 
of  S  =  (S^  S2)  is  unity,  resulting  in  unit  variance  x 
variables.  This  completes  the  ex  /  0  method  for  gener¬ 
ating  S. 

2 . 4 .  Computer  Generation  of  Data  Samples 

x  and  £  are  (n  +  m)  random  variables  with  a  joint 
multivariate  normal  distribution.  The  mean  vector  is  the 
null  vector  and  the  covariance  matrix  is  L  as  given  in 
equation  (2.2.1).  In  order  to  draw  samples  from- this 
distribution,  it  would  be  a  straightforward  procedure  to 
use  equation  (2.2.3)  which  expresses  x  and  ^  in  terms  of 
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independent  N(0,  1)  variables  w.  For  each  simulated 
subject  it  would  be  necessary  to  generate  (n  +  m)  indepen¬ 
dent  N(0,  1)  numbers  and  to  place  these  in  equation 
(2.2.3)  as  the  w  values.  The  sample  x  and  £  vectors  would 
then  be  found  by  matrix  multiplication. 

This  proceudre,  while  conceptually  simple,  has  the 
disadvantage  that  the  computer  time  required  increases 
linearly  with  N,  the  sample  size.  The  method  is  thus 
impossible  to  use  for  all  but  small  sample  sizes. 

Another  procedure  was  chosen  instead.  It  is  not 
based  on  generating  sample  vectors  x  and  £  at  all  but  on 
generating  a  sample  covariance  matrix 


fc 

C 

(2.4.1) 

C  =  XX 

yx 

c 

c 

xy 

yy 

The  method  is  the  Bartlett  decomposition  of  the  Wishart 
distribution  (Bartlett,  1933:  Kshirsagar,  1959;  Wijsman, 
1957).  The  covariance  matrix  C  has  a  Wishart  distribution 
depending  solely  on  the  population  covariance  matrix  Z, 
the  sample  size  N,  and  the  number  of  variables  which 
is  (n  +  m) . 

Let  the  population  covariance  matrix  Z  be  written  as 

(2.4.2)  Z  =  a  ft1 

This  may  be  done  in  a  variety  of  ways.  The  Gauss-Doolittle 
method  for  computing  a  triangular  R  was  used. 

Let  an  ((n  +  m)  x  (n  +  m))  matrix  A  be  defined  as 

(2.4.3)  A  =  ( 1/N )  T  T» 

where  T  is  a  lower  triangular  ((n  +  m)  x  (n  +  m))  matrix 
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whose  lower  triangular  elements  are  independent  random 
variables: 

(i  >  j)  are  N(0,  1) 

(2.4.4)  Tii  are  X(N  -  i) 

T  *  0  (i  <  j)  . 

Then,  if  we  compute 

(2.4.5)  C  =  fl  A  ft*  =  (1/N)  ft  T  T*  ft*  , 

C  will  have  a  Wishart  distribution  as  desired.  Equation 
(2.4.5)  is  the  Bartlett  decomposition  of  the  Wishart 
matrix  C.  A  is  a  sample  covariance  matrix  from  a  population 
with  identity  covariance  matrix.  The  letter  A  is  used 
as  a  temporary  symbol  in  this  paragraph  and  is  reserved 
for  another  use  in  Section  2.6.  The  generation  of  normal 
and  chi  variables  is  described  in  Appendix  B. 

2.5*  Multiple  Regression  on  the  Predictors 

The  most  widely  used  method  for  prediction  is  multiple 
regression.  This  least  squares  method  ensures  that,  in 
the  derivation  sample,  the  correlation  between  the  predicted 
score  and  the  observed  criterion  score  is  a  maximum. 

The  maximum  correlation  is  the  multiple  correlation. 

Let  the  covariance  matrix  of  the  n  predictors  in  the 
the  derivation  sample  (of  size  N)beC(nxn)  and  let 
the  column  vector  c_x^  (n  *  l)  be  the  covariance  between  each 
predictor  and  the  single  criterion  (m  =  1).  Let  the 
coefficients  of  the  multiple  regression  combination  of  the 
predictors  be  b^  (n  x  l),  the  subscript  indicating  that  the 
first  method  of  prediction,  multiple  regression  on  the 
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predictors,  is  being  used.  The  linear  combination  of  the 


predictors  is  then  x. 


The  solution  for  b^  is  well  known  to  be 

(2.5.1)  bj  =  e-J  Cxy  . 

The  correlation  between  $  *  b x  and  y  is  the  multiple 
correlation.  The  square  of  this  correlation  is 
o  b  ’  c 

(2.5.2)  r^  =  -i— 52 

cyy 

where  c  is  the  sample  variance  of  the  criterion.  The 

yy 

proof  of  this  formula  is  presented  in  Appendix  E.  The 
2 

subscript  on  r  is  only  used  in  this  chapter  to  distin¬ 
guish  the  three  methods  of  prediction.  The  subscript  is 
dropped  in  later  chapters. 


2.6.  Multiple  Regression  on  the  Principal  Components 
As  an  alternative  to  the  original  predictors  one 
can  use  the  largest  principal  components  of  the  predictors 
in  the  regression  function.  The  scores  on  the  principal 
components  are  first  estimated  from  the  predictor  scores. 

The  following  calculations  are  made  in  the  derivation 
sample.  The  characteristic  roots  and  vectors  of  the 
covariance  matrix  of  the  predictors,  C  (n  x  n),  are 

A  A 

calculated.  Let  the  roots  be  written  in  descending  order 

2 

on  the  diagonal  of  a  diagonal  matrix  U  and  the  vectors 
in  corresponding  order  as  the  columns  of  an  orthonormal 
matrix  W.  Then  C  may  be  written  as 

A  A 

(2.6.1)  C  =  W  U2  W' 

A  A 

where  all  matrices  are  (n  *  n).  The  characteristic  vectors 


35 


are  the  coefficients  for  relating  the  principal  components, 
f,  to  the  predictors,  i.e. 

(2.6.2)  f  =  W  x 

Q 

where  x  is  the  (n  x  1)  column  vector  of  one  subject’s  scores 
on  the  predictors  and  f  is  the  (n  *  1)  column  vector  of  the 
principal  component  scores  (Anderson,  1958,  pp.  273-277). 

We  wish  to  use  the  t  largest  principal  components 
in  the  regression  function.  From  (2.6.2),  the  scores  on 
these  t  principal  components  are  estimated  by 

(2.6.3)  t(f)  =  t(W')  x 

where  fc(f)  is  (t  *  1)  and  is  the  vector  of  the  first  t 
elements  of  f.  t(W’)  is  the  matrix  consisting  of  the 
first  t  rows  of  W. 

The  prediction  equation,  using  the  t  largest  principal 
components,  is 

(2.6.4)  $(t)  =  d’  t(f) 


where  d  is  a  temporary  symbol  representing  the  t-vec'tor 
of  weights  for  the  principal  components.  The  multiple 
regression  solution  for  d  is 

(2.6.5)  d  =  CfJ  cfy 

in  analogy  to  the  solution  (2.5.1).  In  (2.6.5),  Cff,  is 
(t  x  t)  and  cfy  is  (t  x  l).  The  covariance  matrix  of 
the  t  principal  components  is,  from  (2.6.3),  (2.6.1),  and 
the  orthonormality  of  W, 


(2.6.6)  Cff-t(W>C„  (W)t  =  t(U^)t 

and  the  covariance  of  the  t  principal  components  and  the 

criterion  is 


(2.6.7) 


-fy 


t(W,)  -xy 
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Therefore  the  weight  vector  is 

(2.6.8)  d  =  t(U"2)v  t(W)  cxy  . 

Substituting  (2.6.3)  and  (2.6.8)  into  (2.6.4),  we  find 
that 

(2.6.9)  9{t)  •  c£y  (W)t  t0T2)t  t(W)  x 

is  the  equation  for  predicting  y  from  x  using  regression 
on  the  t  largest  principal  components  of  the  predictors. 
This  equation  may  be  simplified  slightly  by  writing 

(2.6.10)  A  =  W  U 

so  that  (2.6.1)  becomes 

(2.6.11)  Cxx  =AC  . 

Then  (2.6.9)  becomes 

(2.6.12)  $(t)  =  cxy  (A*)"1  t(A“1)  x  . 

Equation  (2.6.12)  is  the  formula  for  predicting 
y  from  x  using  regression  on  the  t  largest  principal 
components.  If  we  express  the  right  hand  side  of  this 
equation  as  [b^^]*  x,  then  the  weight  vector  is 

(2.6.13)  b£t}  =  (A’)"1  t ( A-1 )  cxy 

It  Is  shown  in  Appendix  E  that  the  squared  multiple  cor¬ 
relation  is 

[b^^T  c 

(2.6.14)  [r<t}]2  =  --- - 

cyy 

which  is  identical  in  form  to  (2.5*2)  but  of  course  in¬ 
volves  b!^  instoad  of  b^.  It  is  important  to  note  that 
r^  is  not  the  multiple  correlation  of  y  with  the 
predictors  x.  The  latter  multiple  correlation  is  r^. 

The  symbol  r!^  represents  the  multiple  correlation  of  y 
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and  the  t  largest  principal  components  of  the  predictors 
and  therefore  r^^  must  be  less  than  r^  unless  t  =  n. 

The  subscript  2  is  dropped  in  later  chapters  when  the  con¬ 
text  makes  clear  which  method  of  prediction  is  used. 

Multiple  regression  on  the  principal  components  may 
be  called  a  reduced  rank  method  since  the  use  of  the 
t  largest  principal  components  instead  of  the  original 
predictors  is  equivalent  to  approximating  the  matrix  C 

AA 

by  the  matrix 

(2.6.15)  Cxx  =  (A)t  t (A ' ) 

The  matrix  Cxx  is  of  reduced  rank  t  <  n. 


2.7.  Multiple  Regression  on  the  Principal  Predictors 

Another  method  of  calculating  independent  scores  in 
the  derivation  sample  is  the  method  of  principal  predictors. 
Scores  on  the  largest  principal  predictors  are  used  in 
the  regression  equation.  The  method  is  only  applicable 
if  there  are  several  criteria  (m  >  1). 

Let  the  scores  of  a  subject  on  the  m  criteria  be 
y^,  •••»  ym5  w^^-ch  may  be  placed  in  a  column  vector  y 
(m  x  1).  Each  criterion,  y^  ,  has  a  part,  ,  linearly 
predictable  from  the  n  predictors,  x,  in  the  derivation 
sample : 

<2-7‘1)  ■  -xy^j  C«  i 

where  c_  (n  x  1)  is  the  covariance  of  y,  with  the 
yj  J 


n  predictors  x  and  C  is  the  covariance  of  the  predictors. 

A  A 

The  m  predictable  parts  of  the  criteria  may  be  written 
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as  a  column  vector  £  (m  x  l).  Then  equation  (2.7.1)  may 
be  rewritten  as 


,-l 


(2.7.2)  £  =  C'y  C-J  x  .  Cyx  0-1  x 


where  C  (m  x  n)  is  the  covariance  matrix  of  the  m  criteria 

yx 


with  the  n  predictors.  The  covariance  of  the  predictable 


parts  is  the  (m  x  m)  matrix  C 


(2.7.3)  CM  =  Cyx  0-1  c;x 


Let  us  diagonalize  C ^  in  analogy  to  the  way  that 


C  was  written  in  (2.6.1)  and  (2.6.11): 

AA 


(2.7.4)  •  V  D*  V*  =  G  G’ 


The  matrices  V  and  D  are  sample  estimates  of  the  corres¬ 


ponding  population  matrices  denoted  by  the  same  symbols 


in  Sections  2.2  and  2.3.  The  eigenvalues  are  written 


in  decreasing  order  in  the  diagonal  of  D  and  the  eigen¬ 
vectors  are  written  in  the  corresponding  order  as  columns 
of  V.  The  rows  of  G  are  the  coefficients  for  relating 
the  predictable  parts  of  the  criteria  and  the  principal 
predictors,  i.e. 


(2.7.5)  t  »  G  w  . 

Thus  the  equation  for  estimating  the  scores  on  the  n 
principal  predictors  (column  vector  w  (m  x  i))  from  the 
predictable  parts  is 

(2.7.6)  w  =  G-1  £ 

which,  combined  with  equation  (2.7.2),  gives 

(2.7.7)  w  .  O’1  Cyx  C’x  x 

as  the  equation  for  estimating  w  from  x. 


The  equation  for  estimating  the  t  largest  principal 
predictors  t(w)  is 

(2.7.8)  t(w)  =  ^G'1)  Cyx  C-l  x 

v;here  t(G~^)  is  the  first  t  rows  of  G-1. 

Note  that  w  is  a  vector  of  numbers  which  can  be 
calculated  for  each  subject.  The  use  here  of  w  for  the 
estimated  principal  predictor  score  vector  should  not  be 
confused  with  the  use  of  w,  in  Section  2.2,  for  the  prin¬ 
cipal  predictors,  a  random  vector  in  the  population. 

Since  the  principal  predictors  are  uncorrelated 
in  the  derivation  sample  the  multiple  regression  weights 
for  predicting  each  from  the  principal  predictors  are 
simply  the  rows  of  the  pattern  matrix  G  as  shown  in  (2.7.5). 
Each  row  of  G  is  the  weight  vector  for  one  criterion  variable. 
The  coefficients  G  are  still  the  correct  weights  when  only 
some  of  the  principal  predictors  are  included  in  the 
regression.  If  the  t  largest  principal  predictors  are  in¬ 
cluded,  the  predicted  parts  of  the  criteria  are 

(2.7.9)  =  (G)t  t(w)  . 

In  order  to  express  this  equation  in  terms  of  the  original 
predictor  scores  as 

(2.7.10)  x  , 

equations  (2.7.9)  and  (2.7.8)  may  be  combined,  yielding 

(2.7. 11)  B3  .<£  Cxy  (G')t  t<QM  . 

It  is  shown  in  Appendix  E  that  the  squared  multiple  cor- 

a. 

relation  of  the  j  criterion  variable  y^  is 


which  is  identical  in  form  to  (2.5.2)  ana  (2.6.14)  except 
that  there  is  one  such  equation  for  each  criterion  var¬ 


iable  yj .  As  was  the  case  with  regression  on  the  principal 
components,  the  multiple  correlation  r^j  *s  not 
multiple  correlation  of  with  x  but  is  the  multiple 
correlation  of  y^  and  the  t  largest  principal  predictors. 
The  correlation  r^j^  a^ways  less  than  or  equal  to  r^. 

The  subscript  3  is  dropped  in  later  chapters. 

Multiple  regression  on  the  principal  predictors,' 
as  on  the  principal  components,  is  a  reduced  rank  method. 
The  use  of  the  t  largest  principal  predictors  instead 
of  all  m  principal  predictors  is  equivalent  to  approx¬ 


imating  the  matrix  by  the  matrix 
(2.7.13)  =  (G)t  t (G ’ ) 

The  matrix  is  of  reuuced  rank  t  <  n. 

It  was  pointed  out  after  equation  (2.7.4)  that  the 


eigenvalues  of  C 


ameters  (D^, 


k  =  1, 


estimates  of  the  population  par-- 

2 

.,  in).  In  order  to  estimate  it 


and  (qf;,  k  =  1. 


m)  it  is  natural  to  require  that  the 


covariance  matrices  be  change’’  to  correlation  matrices. 


The  correlations  of  the  predictors  x  and  the  principal 

predictors  w  are  given  by  the  (n  *  m)  sample  matrix 

S,  =  C  .  S,  may  be  written  as 
1  xw  1  J 

(2.7.14)  S,  =  C  =  C  C'1  C  (G’)"1  =  C  (O')**1 
'  1  xw  xx  xx  xy  xy 

by  equation  (2.7-7). 

2 

The  sample  quantity  qj^  is  the  average  dependence 
of  the  predictors  on  the  kth  principal  predictor  and  is 


n 

therefore,  since  the  predictors  have  unit  variance  by  the 
use  of  correlation  matrices,  given  by 

(2.7.15)  sample  qd  =  ^  . 

2 

Similarly  the  sample  estimate  of  tt  ,  the  average  criterion- 
related  predictor  variance  is  the  sum  of  the  squares  of 
all  the  elements  of  divided  by  n  and  is 

«  m  n  p 

(2.7.16)  sample  it  =  (1/n)  £  l  ST. 

k=l  1=1  1K 

m  2 

=  (1/n)  l  sample  q, 
k=l  K 


2.8.  Cross-Validities 

The  calculation  of  correlations  in  the  validation 
sample  is  identical  for  all  three  weight  computational 
methods.  Given  the  weight  vector  b  from  the  derivation 
sample,  the  square  of  the  correlation  between  b'  x  and 
a  criterion  variable  y  in  the  validation  sample  is 


2  -xv 

(2.8.1)  vd  - - - 

c  (b*  C  b)  c 

-  XX  -  yy 

where  c  ,  C  and  c  are  covariances  computed  in  the 
— xy  xx  yy  K 

validation  sample.  The  proof  of  this  formula  is  given  in 
Appendix  E.  The  quantity  rc  is  called  the  sample  cross- 

validity  and  may  be  negative  or  positive.  The  sign  of  r 

-  c 

is  equal  to  the  sign  of  b'  c  .  The  sign  of  r  is  also 

xy  c 

2  2 

affixed  to  r  when  averages  of  several  r  are  taken, 
c  c 


The  predictor  weight  vector  b_,  calculated  in  the 
derivation  sample,  may  also  be  applied  to  the  population 


itself. 


The  souare  of  the  correlation  between  b'  x  and 


Pc  is  called  the  population  cross-validity.  A  sign  is 

2  0 

affixed  to  p  in  the  same  way  as  to  r  . 

c  c 

2 

The  three  statistics,  r  (in  its  several  forms), 

2  2 

r  ,  and  p  are  called  the  correlation  statistics.  These 
c  c  -  - 

statistics  are  the  principal  quantities  computed  in  the 
experiments  described  in  Chapters  3  and  4. 
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CHAPTER  3 

SIMULATION  RESULTS  WITH  ONE  CRITERION  VARIABLE 

A  number  of  experiments  were  performed  using  a  single 
criterion  variable.  The  two  methods  of  prediction  used 
were  multiple  regression  on  the  predictors  and  on  the 
principal  components  of  the  predictors.  The  model  and 
sample  generation  methods  described  in  Sections  2.2  and 
2.3  are  applicable  to  the  single  criterion  case  (m  =  1). 
However,  in  order  to  generate  many  samples  without  exces¬ 
sive  use  of  computer  time,  a  special  procedure  for  the 
m  =  1  case  was  developed.  This  procedure,  described  in 
detail  in  Appendix  D,  allows  any  number  of  criterion  vari¬ 
ables  to  be  generated,  each  with  the  same  population 
multiple  correlation  and  each  with  the  same  relation  to 
the  predictors.  The  criterion  variables  are  thus  all 
duplicates  of  the  single  criterion  of  interest.  The 
number  of  such  duplicates  is  denoted  by  m^. 

As  an  example,  suppose  that  there  are  five  predictors 
and  ten  duplicate  criteria.  Then,  in  a  sample,  one  can 
compute  ten  multiple  correlations,  one  for  each  criterion. 
These  ten  correlations  are  all  based  on  one  sample  (size  N) 
of  five  predictor  scores  but  on  ten  different  samples  of 
single  criterion  scores. 

Each  calculation  described  in  this  chapter  is  based 
on  one  or  more  populations  (I)  generated  for  each  combin¬ 
ation  of  thr'  input  parameters.  For  each  I  thd;  was  generated. 
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sample  covariance  matrices  were  generated  in  pairs  (C^,  C^), 
representing  the  two  samples  needed  for  double  cross- 
validation.  Z,  C^,  and  C 2  are  the  covariance  matrices  of 
(n  +  m^)  variables — n  predictors  and  nu  duplicate  criteria. 
Except  in  Section  3.4,  at  least  two  such  pairs  of  sample 
covariance  matrices  were  generated.  This  allowed  variation 
in  the  predictor  sample  covariance  matrices. 

The  results  using  the  simulation  program  described 
in  this  chapter  are: 

(a)  When  p  =  0,  the  sample  multiple  correlation  follows 
the  known  theoretical  law.  (Section  3.1) 

(b)  When  p  ^  0,  variation  in  Z  produced  by  change  in 

A  xl 

the  number  of  columns  of  S  does  not  affect  the  correl- 

2  2  2 

ation  statistics  r  ,  r  ,  and  p  .  (Section  3.2) 

c  o 

2 

(c)  The  correlation  statistics  depend  on  n,  N,  and  p 
in  theoretically  understandable  ways.  Tables  are 
presented  which  may  be  used  to  interpret  sample  mul¬ 
tiple  correlations  and  cross-validities.  (Section  3.3) 

(d)  The  optimum  number  of  principal  components  to  include*, 
in  the  regression  function  depends  on  the  parameters 
n,  N,  p^,  and  ir^.  (Section  3.4) 

3.1.  .Distribution  of  the  Correlation  Statistics  When  p  =  0 
It  is  useful  to  study  populations  in  which  the  multiple 
correlation  (p)  is  zero  even  though  such  populations  are  of 
little  practical  significance.  Firstly,  the  distributions 
of  the  correlation  statistics  when  p  =  0  provide  a  baseline 
against  which  to  compare  the  distributions  obtained  for 


non-zero  p.  Secondly,  some  properties  of  the  distributions 
for  p  =  0  are  known  theoretically  and  a  comparison  of  the 
distributions  obtained  from  the  computer  model  for  p  =  0 
with  the  theoretical  predictions  provides  a  check  of  the 
model  and  the  computer  calculations. 

Fisher  (1928)  showed  that,  if  r  is  the  sample  multiple 
correlation, 

r^  N  -  n  -  1 

(3.1.D  - * - 

1  -  r^  n 

is  distributed  as  F(n,  N  -  n  -  1)  when  p  =  0.  This  dis¬ 
tribution  has  the  important  property  that  it  is  independent 
of  the  covariance  matrix  of  the  predictors,  X  .  The 

AA 

distribution  of  the  sample  cross-validity,  r  ;  is  not  known 

c 

but  its  expected  value  is  zero  and  its  distribution  is 
symmetric.  Presumably  its  distribution  is  also  independent 
of  X  .  Finally,  the  population  cross-validity,  p  ,  is 

XX  c 

exactly  zero  since  a  =  0. 

xy 

An  effective  simulation  of  multiple  regression  should 

be  able  to  reproduce  these  properties.  The  distributions 
2  2 

of  r  and  r  were  studied  for  two  different  sample  sizes,  N, 
c 

and  for  predictor  covariance  matrices,  X  ,  varying  in 

X  X 

2 

two  ways,  namely  in  the  values  of  the  parameters  it  and  n  . 

s 

2 

tt  is  the  average  of  the  variances  of  the  part  of  each 

predictor  dependent  on  the  principal  predictor]  n  is  the 

s 

number  of  columns  of  the  S  matrix.  Two  values  of  n  were 

s 

used — ng  =  10  implies  a  square  S  matrix  since  n  =  10 
and  ng  =  20  implies  a  non-square  S  matrix. 

The  input  parameters  which  were  constant  for  all 


the  calculations  in  this  section  are  shown  in  Table  1. 
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Table  2  shows  the  variable  model  parameters  and  sample 
sizes  and  also  the  total  number  of  populations  and  samples 
calculated  for  each  model.  According  to  Table  2  four  dif¬ 
ferent  I s  were  generated  for  each  model,  and  for  each  Z 
generated,  three  double  cross-validations  were  performed. 
Since  ten  duplicate  criteria  were  used  throughout  (m^  =  10), 
there  were  altogether  10x1ja3x2  =  240  sample  multiple 
correlations  and  cross-validities  computed  for  each  model. 
The  final  factor  of  two  in  the  preceding  expression  repre¬ 
sents  the  two  samples  (C^  C 2)  which  were  generated  for 
each  double  cross-validation. 

? 

In  order  to  test  whether  r  satisfies  the  Fisher 
distribution  law  (3.1-1),  it  is  necessary  to  have  a  tab¬ 
ulation  of  F(10,  39)  for  Models  1  and  2  and  F(10,  130) 
for  Models  3  and  4.  These  distributions  were  obtained 

directly,  or  by  interpolation,  from  Owen  (1962).  From 
2 

(3-1.1),  r  is  distributed  as 

n  F(n,  N  -  n  -  1) 

(3.1.2)  - 

( N  -  n  -  1 )  +  r.  F  ( n ,  N  -  n  -  1 ) 

The  percentage  points  used  (those  available  in  Owen)  and 

p 

the  corresponding  percentile  points  of  the  F  and  r  dis¬ 
tributions  are  shown  in  Tables  3  and  4. 

The  four  Zs  and  associated  samples  for  each  model 
were  divided  equally  into  two  sets,  chosen  in  the  order 
that  they  were  computed.  The  cumulative  frequency  distri¬ 
bution  of  the  120  sample  squared  multiple  correlations, 

2 

r‘,  are  presented  in  Tables  3  and  4.  Each  set  of  120 


Table  1 


Constant  Parameters  for  Section  3*1 


n  =  10 


=  10  (m  =  1) 

p2  =  0.0 
\  -  0.01 

ex  *  °-° 

v  ,  e  inapplicable  since  m  =  1 

y  y 


Table  2 


Variable  Parameters  for  Section  3.1 


Model  1 

Model  2 

Model  3 

n 

10 

20 

10 

s 

w2 

.2 

.2 

.5 

number  of 

Is 

4 

4 

4 

N 

50 

50 

131 

number  of 

Cl> 

C2 

pairs 

per 

£ 

3 

3 

3 

Model  4 
20 

.5 

4 

131 


3 


Table  3 
2 

Distribution  of  r  for  Models  1  and  2 


2 

Prob-  F(10,  39)  r  Cumulative  Frequencies 

abil¬ 


ity  Model  1  Model  2 


Expected 

Set  1 

Set  2 

Set  1 

Set  2 

1.000 

1.000 

120 

120 

120 

120 

120 

.975 

2.401 

.381 

117 

117 

116 

117 

117 

•  95 

2.086 

.348 

114 

115 

114 

115 

114 

.90 

1.769 

.312 

108 

103 

107 

108 

112 

.75 

1.329 

.254 

90 

85 

80 

98 

91 

.50 

.951 

.196 

60 

50 

52 

60 

59 

.25 

.664 

.145 

30 

25 

23 

30 

26 

.10 

.469 

.107 

12 

16 

12 

11 

9 

.05 

.375 

.0878 

6 

5 

6 

5 

6 

.025 

.307 

.0729 

3 

1 

3 

4 

4 

.000 

.000 

.0000 

0 

0 

0 

0 

0 

MD  =  i 

maximum 

absolute 

difference 

(observed  -  expected) 

10 

10 

8 

4 

Kolmogorov-Smirnov  D  : 

=  MD/120 

.083 

.083 

.067 

.033 

Table  4 


p 

Distribution  of  r  for  Models  3  and  4 


2 

Prob-  F(10,120)  r  Cumulative  Frequencies 

abil^ 

ity  Model  3  Model  4 

Expected  Set  1  Set  2  Set  1  Set  2 


1.000 

1.0000 

120 

120 

120 

120 

120 

.975 

2.157 

.1524 

117 

120 

115 

119 

117 

.95 

1.911 

.1373 

114 

119 

111 

113 

113 

•  90 

1.652 

.1210 

108 

114 

107 

106 

109 

•  75 

1.279 

.0963 

90 

95 

96 

88 

95 

.50 

•  939 

.0726 

60 

65 

69 

60 

69 

.25 

.670 

.0529 

30 

27 

39 

33 

32 

.10 

.480 

.0385 

12 

7 

18 

13 

13 

.05 

.388 

.0313 

6 

6 

9 

7 

5 

.025 

.318 

.0259 

3 

4 

2 

5 

3 

.000 

.000 

.0000 

0 

0 

0 

0 

0 

MD  =  maximum  absolute  difference 

(observed  -  expected)  6939 

Kolmogorov-Smirnov  D  »  MD/120  ,050  .075  -025  .075 


squared  multiple  correlations  originated  in  six  sample 
covariance  matrices  for  each  of  two  Zs.  There  were  ten 
duplicate  criteria  in  each  sample. 

The  observed  frequency  distributions  were  compared 
with  the  expected  distributions  by  the  Kclmogorov-Smirnov 
one  sample  test  (Siegel,  1956).  MD  is  the  maximum  absolute 
difference  between  observed  and  expected  cumulative  fre¬ 
quencies  and  D  =  MD/120  is  the  Kolmogorov-Smirnov  statistic. 
Both  values  are  presented  in  the  tables.  The  critical  D 
for  the  one  sample,  two  tailed  test  is  0.12  (a  =  0.05, 

N  =  120).  None  of  the  Ds  in  Tables  3  and  4  exceeds  this 
value.  There  is  therefore  good  evidence  that  the  multiple 
correlations  generated  by  the  simulation  program  satisfy 
the  Fisher  law. 

The  means  and  standard  deviations  of  the  eight  sets 
of  cross-validities  (120  in  each  set)  are  presented  in 
Table  5-  The  t  values  are  also  shown.  The  critical  t  for 
a  two  tailed  test  is  1.98  (a  =  0.05,  df  =  119).  The  cross¬ 
validities  do  not  have  mean  zero  by  this  test  since  two 
of  the  ts  exceed  the  critical  value  and  one  other  is  almost 
at  the  critical  value.  If  the  models  with  the  same  sample 
size  are  combined  the  ts  are  +1.38  (Models  1  and  2)  and 
-1.49  (Models  3  and  4).  The  critical  t  is  now  1.96  (df  = 
479).  It  is  apparent  that  the  powerful  t  test  is  able  to 
show  small  imperfections  in  the  simulation  procedure., 

The  calculations  in  this  section  have  shown  that  when 
the  population  multiple  correlation  is  zero,  the  sample 
statistics  obey,  at  least  to  a  certain  extent,  the 
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Table  5 

Cross-Validities  for  Models  1  to  4 


Model 

Set 

2 

Mean  r 

c 

Standard 

Deviation 

t 

1 

1 

.00168 

.0402 

.46 

1 

2 

.00439 

.0222 

2.16* 

2 

1 

.00232 

.0405 

.63 

2 

2 

.00068 

.0378 

.20 

3 

1 

-.00232 

.0095 

-2.66* 

3 

2 

.00228 

.0129 

1.94 

4 

1 

-.00146 

.0147 

-1.08 

4 

2 

-.00213 

.0149 

-1.56 

Note:  Each 

rl 

entering  into  the 

mean  has  the 

same  sign 

as  the  corresponding  r  . 

c 

*  p  <  .05 
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theoretically  known  distributions.  The  function  (3.1.1) 
of  the  sample  multiple  correlation  has  an  P  distribution  and 
the  mean  value  of  the  cross-validity  is  approximately  zero. 


3.2.  Dependence  of  the  Correlation  Statistics  on  I„„ 

When  p  ^  0 

It  was  shown  in  Section  2.2  and  Appendix  A  that  the 

2 

average  squared  population  multiple  correlation,  p  , 

2 

depends  only  on  the  (Dkk,  k  =  1,  . ..,  m)  (equation  2.2.11) 
and  not  on  matrix  S.  In  particular,  when  there  is  only 
one  criterion  (m  =  1),  the  squared  population  multiple 
correlation  of  the  criterion  with  the  predictors  is 


(3.2.1)  p2 * * S  =  D2±1 
2 

Since  is  an  input  parameter  it  is  therefore  straight- 

2 

forward  to  specify  an  arbitrary  p  for  a  desired  population. 

2 

Unfortunately  this  simple  specification  of  p  is 

only  possible  when  the  matrix  S  is  square.  When  S  is  non- 

2 

square,  say  with  n  columns,  then  p  is  given  by 

(3.2.2)  p2  =  D2x  s^  (S  S')-1  s1 

which  depends  on  the  matrix  S.  The  vector  is  the  first 
column  of  S  (equation  2.2.13).  Since  S  is  generated  to 
some  excent  randomly  by  the  population  generation  pro¬ 
cedure,  it  is  impossible  to  specify  by  input  parameters 
what  the  population  multiple  correlation  will  be.  This  is 

a  severe  limitation  on  the  model  if  S  is  not  square. 

2 

Because  of  the  ease  m  specifying  p  ,  the  models 
in  the  remaining  sections  of  this  study  al]  employ  square 

S  matrices.  In  order  to  show,  at  least  to  a  certain  extent, 
that  this  does  not  effect  the  generality  of  the  conclusions, 
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some  experiments  are  described  ir.  this  section  which  compare 

the  correlation  statistics  of  square  S  and  non-square  S 

models.  A  further  comparison  is  made  of  two  models  which 

2 

differ,  only  in  the  parameter  ir  . 

The  input  parameters  which  were  constant  for  all 
the  calculations  in  this  section  are  shown  in  Table  6. 

Table  6 

Constant  Parameters  for  Section  3.2 
n  =  5 

md  =  10  (m  =  1) 
vx  =  0.01 
ex  =  0.00 

v  ,  e  inapplicable  since  m  *  1 
«y  «y 

N  =  *10 


Table  7  shows  the  variable  parameters  and  the  total  number 

of  populations  and  samples  calculated  for  each  of  the  ten 

models.  Models  5  and  6  are  square  S  models  differing  only 
2 

in  ir  .  Two  populations  were  generated  for  each  model  and 

three  sample  pairs  for  each  population.  This  results  in 

10x2x3x2=  120  sample  correlations  for  each  model. 

Each  of  the  four  remaining  model  pairs  differ  only 

in  n  ,  one  model  of  each  pair  has  square  S  and  the  other 

model  has  non-square  S.  The  squared  population  multiple 
2 

correlation,  p  ,  is  identical  for  the  model  in  each  pair. 
The  identity  holds  to  six  or  seven  decimal  places  even 
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though  only  three  are  shown  in  Table  7-  The  identity  was 

2 

produced  in  the  following  w ay.  As  explained  above,  p 

2  2 

is  not  an  input  parameter.  The  input  is  equal  to  p 

only  for  square  S  models.  The  non-square  S  Models  8,  10, 

2 

12,  and  14  were  generated  using  =  0.5.  The  population 

2 

squared  multiple  correlation,  p  ,  was  calculated  for  each 

model  by  (3.2.2).  These  are  the  values  in  Table  7-  These 

2 

computed  values  were  used  as  the  input  for  the  square  S 
Models  7,  9,  11,  and  13.  Since  only  one  £  could  be  gener¬ 
ated  for  each  set  of  parameters,  six  C^,  C 2  pairs  were 
generated  instead  of  three.  This  means  that  10  *  1  x 
6x2=  120  sample  correlations  were  generated,  the  same 
number  as  for  Models  5  and  6. 

The  frequency  distributions  of  the  120  correlations 

2  2 

for  each  model  are  presented  in  Tables  8  (r  ),  9  (r  ), 

2 

and  10  (p  ).  The  maximum  absolute  difference,  MD,  and 
c 

the  Kolmogorov-Smirnov  statistic  D  are  also  presented  for 
each  pair  of  models.  The  critical  D  for  the  two  sample 
test,  two  tailed,  is  0.18  (a  =  0.05,  N  =  120).  None  of 
the  sample  values  exceeds  this  value  although  two  approach 
it.  It  can  be  safely  stated  that  for  these  examples  the 
variation  in  Z  has  not  produced  differences  in  the  observed 
distributions  of  the  correlation  statistics.  This  conclusion 
is  further  confirmed  by  the  comparison  of  the  means  of  the 
pairs  of  distributions  as  shown  in  Table  li.  The  critical 
t  value  for  a  two  tailed  test  is  1.98  (a  =  0.05,  df  =  119) 
and  none  of  the  sample  ts  exceeds  this  value. 


Table  8 


Cumulative  Frequency  Distribution  of  r 


2 


for  Models  5  to  14 


Model 


2 

r 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

.85 

120 

120 

120 

.80 

120 

120 

120 

119 

119 

119 

.75 

119 

118 

119 

119 

117 

115 

.70 

108 

111 

120 

120 

115 

119 

113 

109 

.65 

101 

93 

120 

119 

119 

102 

114 

94 

98 

.60 

76 

68 

119 

120 

116 

115 

90 

97 

68 

71 

.55 

44 

48 

118 

117 

111 

111 

73 

73 

49 

51 

.50 

31 

29 

115 

113 

99 

96 

48 

50 

27 

29 

.45 

18 

17 

104 

105 

69 

77 

30 

30 

13 

19 

57 


Table  9 

O 

Cumulative  Frequency  Distribution  of  r~ 
for  Models  5  to  14 


Model 


'I 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

.75 

120 

120 

120 

.70 

120 

119 

119 

117 

.65 

119 

115 

120 

120 

120 

115 

115 

.60 

112 

105 

119 

112 

114 

110 

106 

.55 

102 

83 

120 

119 

120 

99 

168 

91 

93 

.50 

72 

68 

120 

119 

116 

118 

85 

94 

66 

71 

.45 

55 

56 

118 

118 

113 

110 

74 

82 

5i 

47 

.40 

38 

29 

112 

113 

104 

100 

59 

6l 

26 

32 

.35 

27 

18 

106 

111 

88 

86 

43 

42 

13 

24 

•  30 

14 

10 

99 

102 

64 

68 

25 

26 

8 

15 

.25 

8 

4 

88 

98 

44 

53 

13 

17 

4 

8 

.20 

2 

2 

66 

81 

22 

35 

7 

7 

2 

3 

.15 

0 

0 

54 

52 

12 

25 

5 

3 

0 

0 

.10 

31 

36 

7 

18 

3 

0 

.05 

15 

16 

3 

6 

0 

.00 

2 

1 

C 

2 

.05 

• 

0 

1 

0 

.10 

0 

MD 

19 

15 

13 

9 

11 

D 

.158 

.125 

.109 

• 

075 

• 

092 
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Table  10 


2 

Cumulative  Frequency  Distribution  of  p 

V 

for  Models  5  to  l1! 


Model 


2 

pc 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

•  500 

120 

120 

120 

120 

.475 

82 

93 

94 

99 

.450 

53 

56 

120 

120 

59 

61 

.425 

31 

33 

113 

107 

29 

26 

.400 

18 

16 

78 

71 

18 

6 

.375 

13 

7 

47 

37 

9 

3 

•  350 

8 

3 

28 

21 

4 

1 

.325 

6 

2 

120 

120 

14 

11 

2 

0 

.300 

3 

0 

108 

114 

9 

6 

2 

.275 

1 

82 

81 

4 

3 

0 

.250 

1 

120 

120 

57 

52 

1 

1 

.225 

1 

118 

113 

24 

32 

1 

0 

.200 

1 

88 

91 

14 

14 

1 

.175 

1 

65 

65 

5 

10 

0 

.150 

1 

37 

45 

2 

5 

.125 

0 

18 

32 

2 

3 

.100 

9 

15 

1 

3 

.075 

4 

10 

0 

1 

.050 

3 

4 

1 

.025 

2 

4 

1 

.000 

0 

1 

0 

.025 

0 

MD 

11 

14 

8 

10 

12 

D 

.092 

.117 

# 

067 

.083 

.100 
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In  the  simulation  model,  the  population  multiple 

2 

correlation  is  independent  of  tt  ,  the  average  predictor 

variance  related  to  the  criteria.  The  comparison  of 

Models  5  and  6  confirmed  that  the  sample  statistics  are 

2 

independent  of  tt  .  Even  though  the  population  multiple 
correlation  does  depend  on  the  matrix  S  if  it  is  not  square, 
it  was  shown  that  the  correlation  statistics  are  not  affected 
by  the  matrix  S  for  the  models  considered  in  this  section. 

All  further  calculations  in  this  study  invlove  square  S 
models  only. 

3.3.  Dependence  of  the  Correlation  Statistics  on  n, 

N,  and  p2 

The  cross-validation  technique  was  developed  as  a 
way  to  correct  a  sample  multiple  correlation  (Mosier, 

1951).  The  purpose  of  the  calculations  described  in  this 
section  was  to  investigate  empirically  the  relationship 
of  the  sample  multiple  correlation  and  the  cross-validities 
to  the  population  multiple  correlation. 

The  parameters  which  were  constant  for  all  the  cal¬ 
culations  in  this  section  are  shown  in  Table  12.  Table  13 
indicates  the  values  of  the  three  parameters  which  were 

varied.  All  possible  comoinations  of  squared  multiple 
2 

correlations,  p  ,  number  of  predictors,  n,  and  sample  size, 

N,  were  used  except  for  those  combinations  with  (n,  N)  = 

(15,  16).  A  different  model  was  generated  for  each  com- 

p 

bination  of  ( p ‘ ,  n,  N)  so  that  the  models  for1  combinations 
differing  only  in  N  are  different  models. 
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Table  12 

Constant  Parameters  for  Section  3*3 

m  =  1 

n  =  n  (square  S  matrix) 

v  =  0.01 
x 

e  »  0.0 
x 

v  ,  e  inapplicable  since  m  =  1 

y  y 


Table  13 

Variable  Parameters  for  Section  3»3 

n  =  2,  5,  10,  15 

p2  *  0.0,  0.1,  0.25,  0.5,  0.75 


N  =  16,  26, 

50,  131 

models  with  (n,  N) 

(15,  16) 

(10,  16)  all 

md 

5 

number  of  Zs 

not  done 

1 

number  of  C1,  C 2 
pairs  per  £ 

H 

others 

10 

1 


2 


In  all  cases  40  sample  correlations  were  obtained. 

2  2 

The  means  of  the  40  correlations  of  each  type  (r  ,  r  , 

V 

p 

and  p  )  are  shown  in  Tables  14  to  17.  The  expected  values 
2 

E(r  )  as  calculated  from  (1.2.10)  are  also  shown  in  these 

tables.  The  standard  errors  of  the  correlation  means 

2  2 

range  from  O.Oi  to  0.02  for  r  and  rfi  and  somewhat  less 
2 

for  p  .  However,  this  standard  error  does  not  take  into 
c 

account  variation  which  would  be  produced  by  another  pop¬ 
ulation  generated  from  the  same  parameters.  Table  13 
indicates  that  only  one  I  was  generated  for  each  parameter 
combination.  Nevertheless  the  tables  give  a  useful  picture 

of  the  dependence  of  the  three  correlation  statistics 
2 

on  p  ,  N,  and  n. 

The  following  observations  may  be  made  from  the  tables 

2 

(a)  The  squared  sample  multiple  correlation,  r  ,  is  an 

2  2  2 
overestimate  of  p  .  The  expected  values  of  r  ,  E(r  ), 

2 

from  (1.2.10)  match  the  observed  r  values  very  well. 

The  first  two  terras  of  the  expansion  (1.2.11)  may  be 

2 

rearranged  to  show  that  the  bias  in  E(r  )  is  a  simple 

2 

function  of  n,  N,  and  p  : 

(3.3.1)  E(r2)  -  p2  -  -----  (1  -  p2) 

N  -  1 

2  2 

The  match  of  E(r  )  and  r  shows  that  formulas  (1.2.13) 
and  (1.2.7),  which  are  essentially  backward  solutions 
of  (1.2.10),  provide  reasonable  estimates  of  the 
squared  population  multiple  correlation.. 

p 

(b)  The  squared  sample  cross-validity,  r ',  is  generally 

2 

an  underestimate  of  p  and  this  bias  (except  for 
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Table  14 

Correlation  Statistics  for  Sample  Size  N  =  IS 


p 

o 

o 

• 

.10 

.25 

•  50 

.75 

n  = 

2 

E(r2) 

.133 

.211 

.331 

.541 

.763 

r2 

.160 

.189 

.329 

.593 

.805 

rc 

.029 

.121 

.217 

•  534 

.763 

p* 

.000 

.056 

.191 

.468 

.731 

n  = 

5 

E(r2) 

•  333 

.393 

.485 

.647 

.818 

r2 

.307 

.409 

.450 

.669 

.866 

A 

.014 

.086 

.104 

.372 

.691 

A 

.000 

.027 

.107 

.351 

.648 

n  * 

10 

E(r2) 

.667 

.696 

.743 

.823 

.909 

r2 

.680 

.666 

.744 

.821 

.908 

A 

-.035 

.003 

.047 

.300 

.523 

A 

.000 

.009 

.041 

.217 

.489 
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Table  15 

Correlation  Statistics  for  Sample  Size  N  =  26 


p 

.00 

.10 

.25 

.50 

in 

e^- 

• 

n  =  2 

E(r2) 

.080 

.166 

.297 

.523 

.757 

r2 

.073 

.183 

.372 

.542 

.805 

rc 

.016 

.125 

.317 

.491 

.788 

>c2 

.000 

.076 

.226 

.480 

.738 

n  =  5 

E(r2) 

.200 

.275 

.389 

-585 

.789 

r2 

.188 

.277 

.400 

•  565 

.733 

-.013 

.059 

.204 

.384 

.642 

P2 

.000 

.046 

.161 

.413 

.710 

r.  =  10 

E(r2) 

.400 

.456 

.542 

.689 

.841 

r2 

.409 

.455 

.567 

.685 

.855 

r! 

-.010 

.012 

.145 

•  335 

.642 

2 

pc 

.000 

.016 

.102 

•  311 

.629 

n  =  15 

E(r2) 

.600 

.637 

.694 

.792 

.894 

r2 

.604 

.648 

.669 

.804 

.888 

r2 

c 

.009 

.030 

.050 

.252 

.425 

2 

Po 

.000 

.012 

.050 

.199 

.476 

Table  16 


Correlation  Statistics  for  Sample  Size  N  =  50 


P2 

.00 

.10 

.25 

.50 

.75 

n  =  2 

E(r2) 

.041 

.133 

.274 

.511 

.753 

r2 

.037 

.139 

.276 

.524 

.738 

rc 

.002 

.107 

.239 

.496 

.733 

»2 

.000 

.079 

•  233 

in 

00 

■=r 

• 

.746 

n  *  5 

E(r2) 

.102 

.189 

.320 

.542 

.769 

r2 

.092 

.182 

.350 

.547 

.765 

rc 

.007 

.079 

.218 

.468 

.720 

P2 

.000 

.057 

.198 

.453 

.725 

n  =  1C 


E(r2) 

.204 

.281 

.397 

.594 

.795 

r2 

.202 

.263 

.399 

.634 

.790 

r=2 

.005 

.063 

.137 

.476 

.694 

P2 

.000 

.042 

.136 

.418 

.693 

n  =  15 


E(r2) 

.306 

.373 

.474 

.646 

.821 

r2 

.327 

.353 

.483 

.659 

,  806 

r2 

c 

-.003 

.033 

.160 

.350 

.666 

p2 

.000 

.027 

.116 

.350 

.671 
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Table  17 

Correlation  Statistics  for  Sample  Size  N  =  131 


p 

.00 

.10 

•  25 

.50 

.75 

n  * 

«  2 

E(r2) 

.015 

.113 

•  239 

.504 

.751 

r2 

.014 

.118 

.241 

•  526 

.764 

rc 

.001 

.101 

.235 

-515 

.761 

2 

pc 

.000 

.091 

.246 

.496 

.748 

n 

=  5 

E(r2) 

.038 

.133 

.276 

.516 

.757 

r2 

.040 

.127 

•  299 

.500 

.769 

rc 

-.001 

•  075 

.245 

.465 

.749 

p* 

.000 

.076 

.227 

.483 

.742 

n 

=  10 

E(r2) 

.077 

.168 

.305 

.534 

•  i  67 

r2 

.085 

.178 

.295 

.537 

.767 

2 

rc 

.002 

.078 

.209 

.473 

.730 

4 

.000 

.061 

.205 

.464 

.730 

n 

=  15 

E(r2) 

.115 

.203 

•  334 

.554 

.776 

r2 

.116 

.202 

.325 

.573 

.799 

rc 

-.000 

.037 

.174 

.463 

.753 

pc 

.000 

.048 

.181 

.449 

.722 
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p 

p  =0.0)  tends  to  be  approximately  the  same  for 

2  2 
all  values  of  p  for  fixed  (n,  N).  As  with  r  ,  the 

bias  of  the  cross-validity  decreases  with  N  and  in- 

2 

creases  with  n  for  fixed  p  . 

p 

(c)  The  squared  population  cross-validity,  p  ,  is  the 
squared  cross-validity  using  the  derived  weights  on 

a  validation  sample  of  infinite  size.  The  tables 

2  2 

show  that  p  has  similar  values  to  r  and  the  same 
c  c 

2  2 

comments  apply  to  p  as  were  applied  to  r  above. 

c  c 

2 

Looked  at  another  way,  rc  is  an  unbiased  estimate 
of  p2. 

The  tables  confirm  the  known  properties  of  multiple 
regression  and  cross-validation  as  summarized  in 
(1.2.14).  The  sample  multiple  correlation  is  an  overest¬ 
imate  of  the  population  value  since  the  sample  weights  are 
chosen  to  optimize  the  correlation  in  the  derivation  sample. 
These  weights  are  not  the  optimum  weights  in  either  the 

population  or  another  sample  and  the  consequence  is  that 
2  2 

both  pc  and  r£  are  biased  low.  With  repeated  samplings 
2  2 

the  values  of  r  cluster  around  p  since  some  weights  are 
c  c 

better  for  the  validation  sample  than  the  population 

2  2  2  2 

(r  >  p  )  and  other  weights  are  worse  (r  <  p  ). 
c  c  c  c 

These  tables  can,  perhaps,  be  useful  in  estimating 

2  2  2 

the  population  p  from  sample  r  and  r  values  obtained 

V 

from  real  data.  If  the  values  of  (n,  N)  correspond  to 

2  2 

one  of  the  tables  and  if  r  and  rc  match  the  values  in 

2  2 

the  tables  for  a  value  of  p  ,  then  this  value  of  p  is 
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the  estimate  of  the  population  multiple  correlation. 

3.4.  Prediction  from  the  Principal  Components 

Burket  (196*1)  showed  that  cross-validities  can  be 
increased  by  using  only  a  few  principal  components  of 
the  predictors  in  the  prediction  function.  The  formulas 
for  such  prediction  were  presented  in  Section  2.6.  The 
calculations  described  in  the  present  section  demonstrate 
the  improvement  of  prediction  by  using  the  largest  prin¬ 
cipal  components  in  the  simulation  data  and  show  how 
variation  in  some  parameters  can  change  the  effect. 

Sixty  models  were  generated  varying  in  four  para- 

2 

meters:  n,  the  number  of  predictors;  p  ,  the  squared 

2 

multiple  correlation;  ir  ,  the  average  criterion-related 

predictor  variance;  and  N,  the  sample  size.  The  constant 

parameters  are  listed  in  Table  18.  All  combinations  of 

the  variable  parameters  listed  in  Table  19  were  used. 

2  2 

A  new  model  was  generated  for  each  (n,  p  ,  tt  ,  N)  combin¬ 
ation  so  that  here,  as  in  Section  3*3,  combinations  dif¬ 
fering  only  in  N  are  different  models. 

For  each  simulation  model  two  sample  covariance 
matrices,  C-^  and  C were  generated,  both  with  the  same 
sample  size  N.  The  covariance  matrix  of  the  first  sample 
predictors,  (C, )  ,  was  diagonalized  and  the  weights  for 

X  A  A 

predicting  each  of  the  ten  duplicate  criteria  from  the 
largest  principal  components  were  calculated.  These 
weights  were  validated  on  and  the  cross-validities 
r^^  were  calculated  for  each  criterion,  the  superscript 
(1)  indicating  that  one  component  was  included  in  the 
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Table  18 

Constant  Parameters  for  Section  3-^ 

=  10  (m  =  1) 

ng  =  n  (square  S  matrix) 
vx  =  0.01 

ex  =  °'° 

v  .  e  inapplicable  since  m  =  1 

V  J 

number  of  Zs  per  model  =  1 
number  of  C^,  pairs  per  Z  =  1 

Table  19 

Variable  Parameters  for  Section  3.4 
n  -  5,  10 

p2  =  0.25,  0.50,  0.75 

it*"  =  0.20,  0.35,  0.50,  0.65,  0.80 
N  =  2 j,  100  for  n  =  5  and  N  =  25,  105  for  n  =  10 


70 


regression.  The  weights  were  then  recomputed  for  prediction 

from  the  two  largest  principal  components  resulting  in 

(2) 

cross-validities  r  .  This  procedure  was  continued  until 

c 

all  components  had  been  included  in  the  regression.  In 

general,  the  t  largest  principal  components  resulted  in 

ten  cross-validities  r^^  for  each  t  (t  =  1,  ....  n). 

c 

The  average  of  the  squares  of  the  ten  cross-validities 

for  each  t  was  calculated.  The  largest  of  these  averages 

is  called  r2(max)  and  occurs  for  t  =  t  The  symbol 

c  max 

t  represents  the  number  of  components  producing  the 
max 

largest  average  squared  cross-validity.  (When  r^*^  was 
negative,  a  negative  sign  was. affixed  to  its  square  before 
the  averages  were  calculated ) . 

The  above  procedure  was  repeated  for  validating  the 
principal  components  of  (C0)  on  C. .  The  quantities 

4—  AJv  JL 

r2(max)  an(j  t  were  again  calculated.  As  a  result, 
c  max  ° 

2 ( max ) 

there  are  two  values  of  r  and  t  for  each  model 

c  max 

generated,  these  values  are  listed  in  Appendix  I. 

In  Section  3*3  it  was  shown  that  the  squared  cross- 
2 

validity  r  (when  all  variables  or  principal  components 
c 

are  included  in  the  regression)  underestimates  the  squared 

2 

population  multiple  correlation  p  .  The  result  was  con¬ 
firmed  with  the  60  new  models  since  only  14  out  of  the 
120  values  of  [r^n^]2  exceeded  p2.  Recall  that  r^n^  is 

the  same  as  r  .  The  r^max)  were  less  biased  as  30  out  of 
c  c 

120  values  exceeded  p2.  Thus  r2(max)  stni  an  underest¬ 
imate  of  the  squared  population  multiple  correlation. 
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Averaging  the  squared  correlations  before  calculating 

t  has  the  disadvantage  of  reducing  r2(max/  from  What  it 
max  c 

2 

would  be  if  the  maximum  r  was  found  for  each  criterion  and 

c 

then  these  maxima  averaged.  These  maxima  will  occur  for 

different  t  for  the  different  criteria,  in  general. 

In  several  cases  which  were  examined,  however,  it  was  found 

that  most  maxima  occurred  for  the  same  t.  Since  some 

averaging  had  to  be  done  to  comprehend  the  results,  the 

method  previously  described  was  used  for  simplicity. 

There  are  a  number  of  ways  in  which  the  results  in 

Appendix  I  may  be  summarized.  Table  20  show  how  tmax 

2  2 

varies  as  a  function  of  p  ,  tt  ,  and  N  for  each  of  the  two 

values  of  n.  For  example,  the  first  section  of  Table  20 

shows  that  for  all  ten  models  with  p2  =  0.25  and  n  =  5 

(two  values  of  t  „  per  model),  t _ =  1  occurred  five 

max  r  max 

times,  t  =2  occurred  four  times,  etc.  This  section  of 

the  table  shows  that  tmax  tended  to  increase  as  increased. 

2 

This  increase  is  reflected  in  the  correlation  between  p 

and  t  of  0.157. 
max 

A  summary  of  the  correlations  of  tmax  with  the  parameters 

is  given  in  Table  21.  The  last  column  in  the  table  shows 

the  correlations  when  the  data  for  r.  =  5  and  n  =  10  are 

combined.  Before  these  correlations  could  be  computed 

the  values  of  t  for  n  =  10  (which  range  from  1  to  10) 

had  to  be  converted  to  the  same  range  as  the  values  of 

t  for  i.  =  5  (which  range  from  1  to  5).  This  was  done 
max 

bv  recoding  t  =  1  or  2  as  1.  t  3  or  4  as  2.  etc., 

°  max  '  max  *  * 

for  n  =  10.  The  first  three  correlations  in  the  last 


Frequencies  of  tmax  for  Cross-Validities 
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Table  21 


Correlations 

of  t 

max 

O 

(for  r*)  and 

Parameters 

Correlation 

of  t  and 

max 

n  =  5 

n  =  10 

All  n 

2 

P 

.157 

.467 

.302 

*2 

-.593 

-.497 

-.524 

N 

.3^6 

.394 

.361 

n 

number  of  sample 

s  60 

60 

-.251 

120 

The  correlations  in  the  last  column  were  computed 
after  the  values  of  tmax  for  n  =  10  were  recoded  in  pairs 
as  values  from  1  to  5. 


Table  22 

Correlations  of  t^^  (for  p2)  and  Parameters 


Correlation 


of  t  and 

max 

n  =  5 

n  =  10 

All  n 

2 

P 

.217 

.325 

.261 

v2 

-.679 

-.566 

CO 

o 

• 

1 

N 

.364 

.514 

.434 

n 

-.186 

number  of  samples 

60 

60 

120 

The  correlations  in  the 

after  the  values  of  t  for 

max 

last  column 

n  =  10  were 

were  computed 

recoded  in  pairs 

as  values  from  1  to  5. 
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column  are,  in  effect,  an  average  of  the  correlations  in 
the  first  two  columns.  The  last  correlation  in  the  column 
shows  the  correlation  of  n  and  t 

XlidX 

The  size  of  the  correlations  shows  that  each  of  the 
2  2 

parameters  p  ,  it  ,  N,  and  n  has  a  substantial  main  effect 

on  t  .  These  effects  may  be  summarized  as  follows: 
max 

(a)  An  Increase  in  the  number  of  predictors,  n,  produces 

a  decrease  in  t  when  the  values  of  t  „  are  con- 
-  max  max 

verted  (standardized)  to  a  common  scale  with  the  same 
maximum  for  each  n.  In  other  words,  an  increase  in 
n  produces  a  decrease  in 

(b)  An  Increase  in  the  sample  size,  N,  produces  an  Increase 

in  t  . 
max 

(c)  An  increase  in  the  average  criterion-related  predictor 

2 

variance,  it  ,  produces  a  decrease  in  tmax. 

2 

(d)  An  Increase  in  the  squared  multiple  correlation,  p  , 
produces  an  Increase  in  t  x. 

These  results  may  be  explained  in  the  following  ivay. 

When  N  is  small  and  n  is  large,  the  regression  weights 

are  very  unstable  and  the  weights  for  a  few  principal 

components  crcss-validate  better  than  the  weights  for 

many  principal  components.  Hov/ever  as  the  sample  size, 

N,  increases  or  the  number  of  predictors,  n,  decreases, 

the  weights  become  stable  enough  that  several  components, 

accounting  for  more  predictor  variance ,  cross-valldate 

better  than  a  few  components. 

2 

When  7i  ,  the  average  criterion-related  predictor 
variance,  ir.  increased,  one  linear  combination  of  the 
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predictors naaely  w. ,  the  first  principal  predictor, 

is  increase."  in  variance.  The  result  is  that  the  largest 

principal  component  of  the  predictors  bee ores  increasingly 

2 

col linear  with  w-  as  *  increases.  This  means  that  a 

single  principal  component  can  produce  better  prediction 

than  several  comoonents.  Hence  t  decreases  (becomes 

sax 

2 

closer  to  unity*  as  ur  increases. 

2 

The  efiecfc  of  p  on  t  is  not  as  strong  as  the  ef- 

feet  of  the  other  three  paraseters.  However  the  increase 

2 

of  t  with  increasing  p  say  perhaps  be  explained  as 

2aX 

2 

follows.  As  p  increases,  the  correlation  of  each  principal 

component  of  the  predictors  with  the  criterion  increases. 

The  larger  this  correlation  is,  the  less  likely  it  is  to 

2 

vanish  on  cross-validation.  Hence  as  p  increases,  sore 

components  have  correlations  with  the  criterion  which  do 

not  vanish  on  cross-validation.  Hence  t  which  is 

max 

approximately  the  number  of  such  components,  increases 
as  increases. 

The  population  cross-validity,  pc,  reflects  valid¬ 
ation  of  weights  m  the  population  or  an  infinitely  sized 
sample.  When  the  weights  for  all  the  previous  samples 

were  calculated,  they  were  also  validated  on  Z.  The 

2 

same  analysis  as  previously  carried  out  on  rQ  was  done 
on  pf.  The  values  of  p^max)  and  t  are  given  in 

C  C  ITicLX 

Appendix  I.  Table  22  summarizes  the  correlations  of 

2 

t  .  computed  from  p  with  the  parameters.  The  correl- 
max  o 

2 

ations  are  very  similar  to  those  obtained  for  r  . 
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The  effect  of  ii  and  n  on  t _  ,  the  optical  lasbe?  of 

222JC 

principal  components  to  include  in  prediction,  has  been 

known  as  long  as  reduced  rank  prediction  has  been  studied. 

Studies  such  as  those  of  Burfcet  (1963)  have  shown  this 

2  2 

phenomenon.  The  effects  of  p  and  t  are  new,  however, 
for  only  in  simulation  can  these  parameters  be  systemat¬ 
ically  varied. 
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CHAPTER  4 

STUDIES  WITH  SEVERAL  CRITERIA 

The  generation  of  populations  with  several  criterion 
variables  permits  the  principal  predictors  to  be  used  in 
cult ip le  regression.  In  the  first  section  of  this  chapter, 
simulation  experiments  with  several  criteria  are  described. 
The  two  reduced  rank  methods  of  prediction  (multiple  re¬ 
gression  on  the  principal  components  and  on  the  principal 
predictors)  are  compared.  In  the  next  section,  some  studies 
are  described  using  real  data  from  high  school  students. 
Finally,  in  the  last  section,  an  attempt  is  made  to  sim¬ 
ulate  the  real  data  with  the  computer  program. 

4.1.  Simulation  Results 

The  purpose  of  this  section  is  to  compare  the  principal 
component  and  principal  predictor  methods  of  prediction 
in  a  number  of  models  which  differ-  in  the  distribution 

p  2 

of  (q“,  k  *  1,  _ ,  m)  and  v  .  The  importance  of  the 

2 

parameter  v  ,  the  average  criterion-related  predictor 
variance,  in  prediction  from  the  principal  components  of 
the  predictors,  has  already  been  shown  in  the  single  cif- 

terion  case  (Section  3-4).  When  m  =  1,  there  is  only  one 

2  2  2  2 
qk  and  it  is  directly  related  to  ir  (q^  =  n  ir  ).  However 

2 

when  there  are  several  criteria,  there  are  several  qfc, 

2 

each  Oy  representing  the  average  dependence  of  the  pre¬ 
dictors  on  the  kth  principal  predictor.  The  sum  of  the 

2  ,  o 

a,  is  eaual  to  n  ir“ . 

~K 
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Similarly,  the  (D^,  k  =  1,  — ,  m)  are  the  average 

dependence  of  the  criteria  on  each  principal  predictor. 

Since  the  D^k  are  eigenvalues  (of  E^),  they  are  always 

considered  in  descending  order.  In  this  section,  the 
2 

distrioution  was  kept  constant.  By  varying  the  order 
2 

of  the  qk  it  is  possible  to  vary  the  relative  dependence 
of  the  predictors  and  criteria  on  the  principal  predictors. 
This  variation  will  produce  differences  in  the  effective¬ 
ness  of  the  two  methods  of  prediction. 

Eighteen  models  were  studied.  The  constant  input 
parameters  are  shown  in  Table  23  and  the  variable  para- 

p 

meters  in  Table  24.  For  each  value  of  ir  ,  three  different 
2 

q y  distributions  were  used,  called  decreasing,  level,  and 

increasing.  The  same  distributions,  except  for  scaling 

2  2 
by  z  .  were  used  for  all  three  values  of  it  .  Two  models, 

differing  only  in  sample  size,  were  generated  for  each 

2  2 

combination  of  ir  and  Oj.  distribution.  A  pair  of  samples 
of  size  20  (small  K)  was  generated  from  one  model  and 
a  pair  of  sample  of  size  75  (large  N)  was  generated  from 
the  second  model. 

2 

In  the  decreasing  distribution,  half  the  depend¬ 
ence  of  the  predictors  on  the  principal  predictors  is 
dependence  on  the  first  principal  predictor.  When  tt  =0.8, 
this  means  that  ^0%  of  the  total  variance  of  the  predictors 


is  linearly  dependent  on  the  first  principal  predictor. 


The  contribution  of  the  succeeding  principal  predictors 

2  ? 

is  progressively  smaller.  When  tt  =0.2  and  the  a"  are 
again  decreasing,  the  first  principal  predictor  accounts 


Table  23 

Constant  Parameters  for  Section  4.1 


79 


n  =  10 

m  *  5 

n  =  10 
s 

p2  =  0.6 

D11  =  1*2»  D22  =  0,8»  °33  =  °*6» 

°44  *  °*3»  °55  -  0.1 

vx  =  0.01 

ex  =  0.005 

vy  -  °*01 

ey  *  0.005 

number  of  £s  per  model  =  1 
number  of  C^,  C2  pairs  per  £  *  1 

Table  24 

Variable  Parameters  for  Section  4.1 

ir2  =  0.2,  0.5,  0.8 

2 

qk  distribution — decreasing,  level,  increasing 
as  shown  below 

N  =  20,  75 


decreasing 

level 

increasing 

2 

ql 

5.0  tt2 

2.0 

2 

71 

0.625  tt2 

2 

q2 

2.5  tt2 

ro 

• 

o 

2 

ir 

0.625  tt2 

2 

q3 

1.25  tt2 

o 

• 

OJ 

2 

ir 

1.25  772 

2 

q4 

0.625  tt2 

2.0 

2 

TT 

2.5  tt2 

2 

q^ 

0.625  t 2 

2.0 

2 

71 

5.0  tv2 
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for  only  10?  of  •  — Victor  variance  and  the  other 
principal  predictors  account  for  less,  with  a  total  of  20? 
of  the  variance  of  the  predictors  explained  by  the  prin¬ 
cipal  predictors. 

2 

In  the  level  q^  distribution,  each  principal  pre¬ 
dictor  contributes  equally  to  the  predictors,  the  con- 

2 

tribution  of  each  varying  from  16?  when  v  =  0.8  to  ^ 

2 

when  it  =0.2. 

p 

When  the  (qk,  k  =  1,  m)  are  increasing  the 

dependence  is  exactly  reversed  from  the  decreasing  case. 
Most  of  the  dependence  of  the  predictors  on  the  principal 
predictors  is  dependence  on  the  fifth  (last)  principal 
predictor.  The  dependence  on  the  first  principal  predictor 
is  very  small. 

One  change  was  made  in  the  generation  of  samples  for 

the  calculations  in  this  chapter.  The  sample  covariance 

matrices,  and  C^,  were  changed  to  correlation  matrices 

2 

in  order  to  make  possible  the  calculation  of  a  sample  n 
2 

and  sample  (qk,  k  =  1,  ...,  m). 

The  calculations  performed  on  the  sample  correlation 
matrices  were  similar  to  the  calculations  in  Section  3- 24 
except  that  two  methods  of  prediction  were  compared  and 
the  criteria  were  no  longer  duplicate  criteria.  Multiple 
regression  on  the  principal  components  of  the  predictors 
was  done  first.  The  correlation  matrix  of  the  predictors, 
(Cn )  ,  was  diagonalized  and  the  weights  for  predicting 

each  of  the  five  criteria  from. the  largest  component  we re 
calculated.  The  cross-validities  for  each  criterion  in 
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the  second  sample  were  calculated.  The  average,  over 

criteria,  of  the  squares  of  these  validities  was  calculated 

2 

and  is  presented  in  Tables  25  to  30  in  the  "rc"  columns 
under  "P.  C."  for  t  =  1  (one  component).  The  weights  were 
then  recomputed  for  the  two  largest  components  (t  =  2) 

S 

and  the  average  squared  cross-validity  calculated..  This 

/ 

was  repeated  for  t  =  3,  ...,  10.  The  whole  procedure  was 

repeated  again  for  validating  the  weights  derived  in  C£ 

on  C.,  but  these  results  are  not  reproduced  in  the  tables 

as  they  are  very  similar  to  the  validation  of  on  C 

Prediction  from  the  principal  predictors  was  then 

performed  following  the  method  given  in  Section  2.7. 

The  weights  in  equation  (2.7.11)  were  calculated  in  the 

first  sample  for  each  value  of  t  (t  =  1,  ...,5)  and  the 

weights  were  cross-validated  in  the  second  sample.  The 

average  of  the  cross-validities  for  each  value  of  t  was 

2 

calculated  and  is  given  in  Tables  25  to  30  in  the  "r  " 

w 

columns  under  "P.  P."  for  each  t.  The  validation  of 
sample  2  on  sample  1  is  not  reported  here. 


Several  Criteria  (N  ■  20)  qr  Distribution  Decreasing 
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2 

Consider  first  K  =  75  (Tables  28  to  30).  When  a  = 
0.2,  the  first  few  principal  predictors  are  far  superior 
to  the  first  few  principal  components  in  average  cross- 
validity.  This  is  true  whether  the  Cy  distribution  is  de¬ 
creasing,  level,  or  increasing.  The  reason  is  that  the 
principal  predictors  account  for  only  205  of  the  predictor 

"V 

variance  when  =  0.2.  The  largest  principal  components 


therefore  reflect  variance  costly  independent  of  the 
principal  predictors  and  therefore  the  largest  principal 
components  are  not  good  predictors.  The  principal  predict¬ 


ors,  though  of  small  variance,  are  good  predictors  and 

multiple  regression  on  them  cross- validates  veil - 

_  2 

The  situation  changes,  however,  as  ~  increases  to 

0.5  and  0.8.  When  *  =0.3  and  the  cf.  distribution  is 

decreasing  or  level  (not  increasing),  there  is  practically 

r.o  difference  between  the  two  prediction  methods  In  the 

average  cross-validity  for  t  =  1,  — ,  5-  When  »  =  0.8, 

the  five  principal  predictors  account  for  a  total  of  80S 


of  the  predictor  viriar.ee  and  thereSre  the  principal 
components  of  the  predictors  are  very  similar  to  the 
Drincioai  creaictors.  Therefore  the  or incisal  consonants 


and  principal  predictors  cross-validate  equally  well. 

,2 

!»- 
r- 


Ar.  excestion  occurs  for  the  increasing  c,  distri- 

•  m  ■■  ...... . . -  Lr 

2 


bution  when  tt  =  O.o  (.able  30).  Here  the  principal 
predictors  cross-validate  much  better  than  the  first  few 


principal  components.  The  reason  is  that  the  fi_ 
cisal  oredictor  {largest  .  and  hence  test  :,rea 

<CK 

is  the  smallest  principal  predictor  in  terns  of 


rst  srin- 
ic'or ) 
associated 
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predictor  variance  ( c,  is  the  smallest  q.~ ) .  The  principal 
predictor  method  of  regression  properly  picks  this  first 
principal  predictor  as  the  best  predictor.  The  principal 
component  method  of  prediction  however  chooses  the  orin— 
cipal  predictors  in  reverse  order  since  the  principal  pre¬ 
dictor  with  largest  predictor  variance  (5QS)  is  the  fifth 
principal  predictor  and  the  fourth  principal  predictor  has 
the  next  largest  variance  ( 202 ) ,  etc.  ’lose  that  there 
is  a  large  increase  in  average  cross-validity  between  t  *  ft 
and  t  =  5  principal  components  in  this  case.  The  fifth 
principal  component  is  approximately  col linear  with  the 

first  principal  predictor  which  is  the  best  predictor. 

o 

On  the  other  hand,  when  s-  =  0.?,  the  large  increase  in 

average  cross-validity  using  principal  components  does 

not  occur  u-  il  t  =  8  or  9- 

o 

The  result.  <hen  0.5  are  intermediate  between 
2  2 

those  for  s'  =  0.2  .  x  =  0.8.  Furthermore,  the  results 
for  21  5  20  (Tables  25  to  ?7 }  s issl isr  t o  ths  Ji  »  7k 

results  just  described  except  that  the  effects  are  not 
as  clear  due  to  instability  of  the  weights  with  small  I«. 

An  interesting  effect  is  shown  in  two  cases,  however 
(Table  25,  =  0.5  ana  Table  27,  =  0.2).  This  effect 

is  not  dependent  on  sample  size  and  could  have  occurred 
for  H  =  75-  in  both  cares  the  average  cross-validity  when 
all  factors  are  included  in  the  regression  is  essentially 
zero  since  the  predictors  are  dependent.  The  population 
Z  which  was  generated  in  these  two  cases  was  almost 
singular.  This  was  shown  by  the  difficulty  in  inverting 
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it.  Every  time  a  matrix  is  inverted  in  the  simulation 
program,  the  result  Is  checked  by  multiplying  the  inverse 
by  the  original  matrix  and  comparing  the  result  with 
the  identity  natriz.  The  largest  difference,  in  absolute 
value,  between  corresponding  elements  in  the  two  matrices 
iS  printed  as  h'JXY.  ’-lost  generated  Z^x  matrices  yield 
slilV  =  0.00001  or  less.  In  the  two  cases  mentioned  above, 
'ilHV  =  0.003  and  0.0001,  respectively ,  indicating  approx¬ 
imate  dependence  of  the  predictors  in  the  population. 

This  dependence  appears  In  the  generated  samples  as  well. 

In  the  two  singular  cases,  prediction  from  the  prin¬ 
cipal  components,  for  t  <  10,  Is  successful.  However, 
for  all  values  of  t,  the  cross-validities  using  the  prin¬ 
cipal  predictors  as  predictors  are  practically  zero. 

The  t  largest  principal  components  (t  <  10)  are  independent 
and  their  weights  cross-validate  well.  This  is  an  advantage 
of  prediction  from  the  principal  components  of  the  pre¬ 
dictors — the  effect  of  dependence  of  the  predictors  can 
be  eliminated.  However  the  weights  on  the  principal  pre¬ 
dictors  are  not  stable  in  the  singular  case,  regardless 
of  the  number  of  principal  predictors  included  in  the 
regression. 

Even  though  variation  of  the  population  parameters 


has  an  appreciable  effect  on  prediction  by  tne  two  reduced 

rank  nethcus,  "'or  practical  application  of  t.jese  results 

it  would  be  necessarj  to  deter- ine  from  samples  what  the 

population  parameters  are.  Car.  these  parameters  le  estimated? 
2  2 

The  sample  (q^,  k  =  1.  . b'1  and  u  are  estimates  of 


the  corresponding  population  parameters  and  are  shown 

2 

in  Tables  25  to  30.  In  general  the  sample  distribution 

Is  similar  to  the  population  distribution.  This  is  shown 

most  clearly  for  large  sample  size  (N  =  75).  When  the 
2 

population  distribution  is  decreasing,  the  largest 

2  2 
sample  qk  generally  occurs  for  k  =  1.  When  the  qfc  distri- 

2 

button  is  level,  the  sample  q^  are  approximately  equal. 

2 

When  the  qfc  distribution  is  increasing,  the  largest  sample 
2 

q^  normally  occurs  for  k  =  5-  It  is  therefore  possible 
to  decide,  on  the  basis  of  the  (qk,  k  *  1,  ...,  m)  in 
the  derivation  sample,  whether  the  principal  predictors 
cross-validate  better  than  the  principal  components  or 
whether  there  will  be  little  difference  between  the  two 
methods. 

As  an  additional  aid  in  making  this  determination, 

2 

it  is  important  to  estimate  tt  .  This  may  be  done  from 
2 

the  value  of  tt  computed  in  the  derivation  sample.  It 

2 

can  be  seen  from  the  tables  that  the  sample  v  is  a  rough 
measure  of  the  population  value,  there  is  a  tendency  for 
the  sample  value  to  shift  nearer  0.5  than  the  population 


value . 


Since  the  population  (D^,  k  =  1,  . ..,  5)  wc-re  not 

2 

varied  in  these  simulation  studies,  the  sample  shown 
in  the  tables  are  relatively  constant  from  sample  to 


sample.  These  parameters  will  be  discussed  further  in  the 


next  section. 


This  section  has  shown  that  prediction  from  the 
principal  predictors  is  an  effective  method  of  prediction, 
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2 

particularly  when  the  qk  distribution  is  increasing  or 
2 

tt  is  small.  In  other  cases  prediction  from  the  prin¬ 
cipal  components  is  almost  as  successful  as  prediction 
from  the  principal  predictors.  The  only  case  in  which 
prediction  from  the  principal  components  is  superior  to 
prediction  from  the  principal  predictors  is  when  the 
predictors  are  dependent. 

4.2.  Study  of  Real  Data 

The  calculations  described  in  the  preceding  section 
were  also  performed  on  some  samples  of  real  data.  The 
data  were  collected  in  1961  by  the  Educational  Testing 
Service,  Princeton,  N.  J.,  from  1205  boys  in  academic 
high  schools.  The  21  variables  employed  intthe  multiple 
regression  calculations  are  listed  in  Table  31.  There 
are  two  sets  of  eight  predictors  each  and  one  set  of 
five  criterion  variables.  The  first  set  of  predictors, 
called  the  S-predictors ,  consists  of  six  variables  from 
the  Sequential  Tests  of  Educational  Progress  (STEP)  and 
two  variables  from  the  School  and  College  Ability  Tests 
(SCAT).  The  second  set  of  predictors,  called  the  T-pre- 
dictors,  consists  of  eight  variaoles  from  the  Tests  of  Gen¬ 
eral  Interest  (TGI).  The  criterion  variables  are  two 
variables  from  the  Scholastic  /'ptituue  Test  (SAT),  two 
variables  from  the  College  Entrance  Examination  Board 
(CEEB),  and  the  rank  in  the  high  school  c3ass. 

Eight  samples  were  drawn  at  random  from  the  pool 
of  1205  subjects.  Pour  of  ;he  samples  were  of  size  N  =  20 


Table  31 

E.  T.  S.  Variables 


S-predictors 

STEP  Mathematics 
STEP  Science 
STEP  Social  Studies 
STEP  Reading 
STEP  Listening 
STEP  Writing 
SCAT  Verbal 
SCAT  Quantitative 


T-predictors 

TGI  Industrial  Arts 
TGI  Home  Arts 
TGI  Physical  Education 
TGI  Biological  Science 
TGI  Music  and  Art 
TGI  History-Literature 
TGI  Entertainment 
TGI  Public  Affairs 


Criteria 

SAT  Verbal 

SAT  Mathematical 

CEEB  English  Composition 

CEEB  American  History 

Rank  in  High  School  Class 
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(Samples  1,  2,  3,  * )  and  the  other  four  samples  were  of 
size  ll  =  75  (Samples  5,  6,  7,  8).  Four  double  cross-vali¬ 
dations  were  performed  (two  for  each  sample  size)  using 
the  G-predictors.  Then  the  same  samples  were  used  in 
four  double  cross-validations  using  the  T-preaictors. 

Tables  32  (fi  =  20)  and  33  (N  =  75)  are  a  summary 
of  the  calculations  made  on  these  samples;  the  calculations 
were  the  same  as  those  made  on  the  simulation  samples  in 
Section  *».l.  Correlation  matrices  were  used.  Again, 
only  the  validation  of  on  C2  is  reported. 

In  most  of  the  samples,  the  principal  predictors 
(P.  P.)  validate  more  poorly  than  the  principal  components 
(P.  C.)  and  in  the  two  cases  where  the  first  principal 
predictor  validates  better  than  the  first  principal  com¬ 
ponent,  the  improvement  is  not  great.  Another  feature 
of  these  data  is  the  nearly  constant  average  cross-validity 
of  the  principal  predictors  for  t  =  1,  ...,  5.  Even  though, 
in  some  cases,  a  few  principal  components  are  far  superior 
to  including  all  predictors  in  the  regression,  in  no 
case  is  the  first  principal  predictor  significantly  better 
than  all  predictors. 

These  findings  can  be  understood  by  considering  the 

2  ?  ,  ? 

estimates  of  the  parameters  p  ,  n",  iD,,  ,  k  =  1.  5), 

h  K. 

2 

and  (a. ,  k  =  1 _  . . . „  5) •  In  all  four  derivation  samples, 

K 

2  0 

the  sample  tt  is  at  least  0.87  for  the  G-predictors  and 
at  least  0.78  for  the  "'-predictors .  Therefore  these  samples 

p 

correspond  approximately  to  the  it-  =  0.8  cases  of  Section 
*1.1.  furthermore  the  sample  distributions  are  : ail 


Table  52 

E.  T.  S.  ^^mples  (N  =  20) 


t  or  k 


Sample  1  validated 
on  Sample  2 


P.C.  P.P. 
2  2 


Sample 


Sample  3  validated 
on  Sample  4 


P.C.  P.P. 


Sample 


S-predictors 


1 

•  56 

.41 

3.15 

4.68 

.47 

.42 

3-93 

5-83 

2 

.55 

.42 

.40 

.83 

.47 

.41 

.28 

.34 

3 

.56 

.46 

.24 

-59 

.42 

.39 

.11 

-35 

4 

.53 

.44 

.11 

.41 

.44 

.36 

.03 

-43 

5 

.52 

.02 

.46 

.45 

.37 

.01 

.26 

6 

.52 

.44 

7 

.51 

.40 

8 

.44 

.37 

Sample 

2 

P 

p 

CO 

.87 

Samole  ir‘ 


Sample  p 
Sample  tt‘ 


.26 

.09 

2.16 

T-predictors 
1.69  .28 

.35 

3.51 

4.31 

.27 

.06 

.27 

.68 

.31 

.32 

.25 

.42 

.27 

.08 

.16 

1.93 

.33 

.32 

.09 

.u7 

.27 

.07 

.06 

1.57 

•  32 

•  35 

.04 

.68 

.15 

.08 

.02 

.37 

•  32 

.34 

.01 

.51 

.13 

.12 

.08 

2 

P 

.54 

•  30 
.32 

•  34 

oo 

C"— 

• 

.78 


.80 


Table  33 

E.  T.  S.  Samples  (22  *  75) 


1 

2 

3 

4 

5 

6 

7 

8 


Sample  5  validated 
on  Samole  6 


t  or  k 


P.C. 

2 


•  65 
.66 
.66 
.64 
.64 
.64 

.63 

.63 


Sample  p‘ 


Sample  ts‘ 


d 

■* 

c 


.61 
•  63 

.63 

63 

63 


Sample  7  validated 
on  Sample  8 


P.P. 


Sample 


Dkk 


-7C 


qk 


P.C. 

2 


S-pr edict or s 


.  6c 
.68 
•  69 


.88 


P  D 

A  •  _  , 

2 


3.28 

5.55 

.64 

.  06 

-13 

•  57 

.  66 

.69 

.07 

.40 

.67 

.69 

.02 

•  27 

.67 

.69 

.01 

.24 

•  67 

-69 

Sample 


Dkk 


3.29 
•  09 
.06 
.03 
.01 


70 


°k 


5. 


.90 


1 

.49 

.37 

2 

.49 

•  37 

3 

.48 

•  37 

4 

.44 

•  37 

5 

.44 

•  37 

6 

.40 

7 

•  39 

8 

•  37 

p 

Sample  p 

2 

Sample  71 


?-predictors 

2. *7  3-5*  .48 

.04  .82  .47 

•02  .62  .45 

.01  .64  . 46 

.00  .62  .46 

.45 

.44 

.43 

•  pi 

.78 


.46 

2.09 

4.15 

.  4p 

.09 

•  53 

.44 

.04 

•  51 

.43 

.  02 

.76 

•43 

-01 

.67 

.45 


AT  c\j  on<\j 


2 

cases  but  one  heavily  weighted  on  thus  indicating  an 

2 

extremely  decreasing  qk  distribution.  Returning  to  the 

corresponding  simulation  examples  in  Section  4.1,  it  is 

seen  that  the  E.  T.  S.  results  do  not  differ  greatly  from 

the  last  columns  of  Tables  25  (N  =  20)  and  28  (N  =  75). 

The  failure  of  the  principal  predictors  in  the  E.  T.  S. 

2 

samples  can  be  further  explained  by  the  sample  Dkk  aistri- 

2 

bution.  In  all  cases,  D^1  is  at  least  80%  of  the  total 

2 

predictable  variance  and  therefore  the  Dkk  distribution  in 

the  E.  T.  S.  data  is  weighted  more  in  favor  of  than 

the  populations  considered  in  Section  4.1.  This  means 

that  the  first  principal  component  is  very  similar  to  the 

first  principal  predictor  and  hence  prediction  from  the 

principal  components  is  effective. 

In  the  next  section  some  simulation  models  will  be 

considered  that  more  closely  match  the  E.  T.  S.  data, 

2 

particularly  in  the  Dkk  distribution. 

4.3.  Simulation  of  Real  Data 

The  E.  T.  S.  data  samples  differ  from  the  simulated 

p 

data  in  Section  4.1  in  several  respects.  The  (Dkk,  k  = 

2 

1,  ...,  5)  distribution  is  much  more  concentrated  on 

in  the  E.  T.  S.  data  than  in  Section  4.1  where  the  population 
2 

Dkk  distribution  was  fixed  as  (1.2,  0.8,  0.6,  0.3,  0.1). 

The  distribution  of  (qk,  k  =  1,  ...,  5)  in  the  E.  T.  S. 

2 

data  is  similar  to  the  decreasing  qk  distribution  in 
Section  4.1  although,  in  most  cases,  the  E.  T.  S.  data 
had  an  even  larger' q^. 
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The  S-predictors  were  superior  to  the  T-predictors ; 

p 

the  approximate  population  p  for  the  S-predictors  might 
be  estimated  to  be  0.65  and  for  the  T-predictors  about 
0 .  *4 5 .  Other  parameters  were  roughly  estimated  for  the 
two  sets  of  predictors  and  are  shown  in  Table  31*  •  These 
parameters  were  used  to  generate  population  and  sample 
correlation  matrices  using  the  simulation  program.  Two 
populations  (First  Z  and  Second  Z)  were  generated  for 
the  S-predictor  simulation  and  two  populations  (Third  Z 
and  Fourth  Z)  were  generated  for  the  T-predictor  simul¬ 
ation.  One  pair  of  sample  correlation  matrices  was  gen¬ 
erated  for  each  population  (N  =  75  in  all  cases).  The 
results  of  the  validation  of  C-^  on  C?  for  each  of  the 
four  populations  are  shown  in  Table  35. 

If  these  tables  are  compared  with  the  corresponding 
Table  33  for  the  real  data,  it  will  be  seen  that  the 
real  and  simulation  results  correspond  closely.  By  adjust¬ 
ing  the  parameters  it  would  be  possible  to  make  the  match 
even  better,  but  the  simulation  using  the  parameters  in 
Table  3^  is  presented  here  since  this  simulation  was  the 
first  attempted.  The  close  similarity  of  the  simulation 
results  to  the  results  from  the  E.  ?.  S.  data  shows  that 
the  simulation  model  is  basically  sound. 
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Table  3*4 

stinated  Paraneters  Used  to  Simulate  E.  T.  S.  Data 

n  =  v3 

m  =  5 

ns  =  8 
v  =  0.01 

e  =  0.005 

7. 

v  =  0.01 

y 

e  =  0.005 

y 

u  =  75 

number  of  Is  per  predictor  type  =  2 
number  of  C^,  C 2  pairs  per  1=1 

S-preaictors  (First  I  and  Second  I) 
p2  =  0.65  -  * 

D11  *  3-°-  B22  -  01>  D33  '  °°5’ 

D2^  =  0.05,  0^  =  0.05 

s2  =  0.85 

q2  =  5-0,  q2  =  0.6,  c2  =  0.$,  of  =  0.*!,  q2  =  0.*s 

T-preaictors  (Third  I  and  Fourth  I) 
p2  =  0.*J5 

D11  =  2*0,  D22  =  °*A»  D33  =  °-°5> 

D2^  =  0.05,  D25  =  0.C5 

n2  =  0.75 

a2  =  *4.5,  o2  =  0.5,  a2  =  0.*J,  a.2  =  0.3,  a2  =  0.3 

5  **  D 
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Table  35 

Simulation  of  £-  T.  S.  Sables  (K  *  75) 


S-precictors 


First  Z 

Sample  A  validated 
cn  Sample  3 


Second  Z 

Sample  C  validated 
on  Sample  D 


P.C. 

?.  P. 

Sample 

P.C. 

?.?. 

Sample 

t  or  J: 

2 

-2 

■> 

c2 

-2 

2 

d2 

e2 

■  c 

*  e 

i:r. 

'  c 

rc 

ifcfe 

Qk 

1 

-56 

-57 

3 -2ft 

5-05 

-59 

-57 

3-27 

5.2ft 

2 

-58 

.58 

-15 

,*♦  fVi 

.  -f 

-59 

.60 

.16 

-56 

3 

.60 

.60 

.12 

.61 

-5$ 

.11 

-32 

ft 

.60 

.61 

.03 

-ft5 

.63 

.60 

.05 

.ft! 

5 

.61 

.62 

.02 

-56 

.6ft 

.61 

.01 

-55 

6 

.60 

.61 

7 

.61 

.61 

8 

.62 

.61 

Sample 

2 

P  - 

.71 

.72 

Sample  ir‘ 


.86 


T-predictors 


Third  Z 

Sample  E  validated 
on  Samole  F 


t  cr  k 

1 

2 

ft 

5 

6 

7 

8 

Sample 

Samole 


Fourth  Z 

Sample  G  validated 
on  Sample  H 


P.C. 

?.?. 

Sample 

P.C. 

P.P. 

Sampl 

0 

2 

2 

-%2 

2 

2 

2 

_2 

2 

r» 

~  C 

4  c 

Dkk 

G-k 

rc 

*  c 

Dkk 

°-k 

-39 

.ftl 

2.0ft  ft 

.21 

-51 

-5ft 

1.67 

ft.  1ft 

-39 

-37 

-3ft 

-27 

-51 

-55 

.20 

-ftl 

.fto 

.36 

-19 

-51 

-52 

-53 

-13 

.86 

.fto 

-38 

.12 

-51 

-52 

-52 

-09 

.28 

.ftl 

-39 

-03 

-57 

-51 

-52 

-03 

-35 

.  ft2 

-52 

.ftft 

•  53 

-39 

-52 

2 

0 

-5ft 

.ft2 

•  76 


.75 
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CHAPTER  5 

SUKJLARY  AltD  CONCLUSIONS 

The  several  methods  of  multiple  regression  discussed 
in  this  study  are  designed  to  provide  optimal  weights 
for  predictor  variables.  The  weights  are  optimal  in  the 
sense  that,  in  new  samples,  the  weighted  linear  combin¬ 
ation  of  the  predictors  has  the  highest  possible  correlation 
with  the  criterion  variable.  3y  means  of  cross-validation, 
it  Is  possible  to  estimate  the  correlation  in  new  samples 
using  only  data  in  a  (divided)  original  samole. 

For  any  given  problem  it  is  important  to  decide  which 
prediction  method  gives  the  best  weights.  If  one  exnaust- 
ivelv  tries  all  prediction  methods  it  is  straightforward, 
using  cross-validation,  to  pick  the  best  linear  combination 
of  the  predictors.  But  there  are  some  disadvantages  to 
this  procedure.  It  Is  lengthy  even  with  a  computer;  there 
is  capitalization  on  chance  results;  and  the  procedure 
does  not  provide  a  way  to  generalize  to  new  variables  or 
new  populations. 

It  is  apparent  that  no  one  method  of  prediction  will 
be  optimal  for  all  possible  predictor  and  criterion  distri¬ 
butions.  Even  if  one  method,  for  example  prediction  from 
the  principal  components,  were  superior,  it  would  still 
be  necessary  to  decide  the  number  of  components  to  include 
in  the  regression.  Burket’s  (1964)  work  included  the  com¬ 
putation  of  statistics  which  were  of  some  assistance  in 
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deciding  how  many  principal  components  to  include  in  the 
regression.  The  present  study  considered  some  fundamental 
parameters  of  the  population  distribution  which  are  relevant 
to  the  choice  of  prediction  method  and  the  number  of 
components  to  include  in  the  regression. 

In  order  to  study  the  effect  of  these  parameters 
on  prediction,  the  distributions  were  simulated  on  a  com¬ 
puter.  The  parameters  were  systematically  varied  and  the 
prediction  methods  were  compared  for  each  parameter  set 
by  applying  the  weights  to  cross-validation  samples. 

In  Section  3-3  the  accuracy  of  the  sample  multiple 
correlation  and  cross-validity  as  measures  of  the  popu¬ 
lation  multiple  correlation  and  cross-validity  were  studied. 

2 

The  squared  sample  multiple  correlation,  r  ,  is  an  over- 

2 

estimate  of  the  squared  population  multiple  correlation,  p  . 

The  bias  tends  to  decrease  with  increasing  sample  size 

and  to  increase  with  increasing  number  of  predictors  and 
2 

increasing  p  .  The  bias  is  correctly  estimated  by  formulas 

of  V/ishart  (1931)  and  Wherry  (1931)  •  The  sample  and 

population  cross-validities  are  approximately  equal  and 
2 

underestimate  p  .  The  sample  cross-validity  is  there¬ 
fore  a  good  estimate  of  the  population  cross-validity 
but  not  of  the  population  multiple  correlation. 

The  dependence  of  the  cross-validation  of  the  prin¬ 
cipal  components  of  the  predictors  on  the  distribution 
parameters  was  considered  in  Section  3 • •  The  technique 
used  was  to  calculate  the  number  of  principal  components 
which  produce  maximum  cross-validity.  This  number,  called 
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l  ,  was  studied  as  a  function  of  four  Darameters — the 
max 

sample  size,  ?i,  the  number  of  predictors,  n,  the  squared 

2 

population  multiple  correlation,  p  ,  and  the  average  cri- 

2 

terion-relatea  predictor  variance,  w  .  It  was  found 

2 

that  t  is  an  increasing  function  of  N  and  p  and  a 
max 

2 

decreasing  function  of  n  and  tt  .  This  means  that  a  few 

(1  or  2,  say)  principal  components  will  be  more  effective 

2 

than  many  components  when  N  is  small,  n  is  large,  p  is 
2 

small,  and  ir  is  large. 

Many  prediction  problems  in  psychology  involve  multiple 
criteria,  no  one  of  which  can  be  considered  to  be  the 
criterion.  A  convenient  way  to  avoid  choosing  one  criterion, 
and  at  the  same  time,  achieve  some  synthesis  of  the  cri¬ 
teria,  is  to  weight  each  standardized  criterion  equally 
and  to  optimize  the  prediction  of  all  criteria  simultan¬ 
eously.  The  effectiveness  of  any  prediction  method  can 
then  be  estimated  from  the  average  squared  cross-validity. 

Two  prediction  methods  were  compared  in  this  way. 

The  first,  prediction  from  the  largest  principal  components 
of  the  predictors,  does  not  use  criterion  information  in 
the  selection  of  the  components  and  may  be  used  for  one 
or  several  criteria.  The  second,  prediction  from  the 
principal  predictors,  uses  criterion  information  to  calcul¬ 
ate  the  principal  predictors  themselves.  This  method 
optimizes  the  average  squared  multiple  correlation  in 
t he  derivation  sample. 

It  was  found,  for  the  distributions  studied,  that 
the  principal  predictors  had  superior  or  equal  cross- 


1G4 

validities  to  the  principal  components  except  when  the 

predictors  were  approximately  dependent.  The  superiority 

of  principal  predictors  was  particularly  evident  when 
2  2 

it  was  small  and  the  q"  distribution  was  increasing,  meaning 

that  the  first  principal  predictor  accounted  for  much  less 

of  the  predictor  variance  than  the  last  principal  predictor. 

2 

However  this  combination  of  parameters — tt  small  and 
2 

the  distribution  increasing — may  occur  rarely,  if  at 

all,  in  real  multivariate  distributions.  In  the  sample 

of  real  ability  and  interest  data  from  the  Educational 

2  2 

Testing  Service,  tt  was  very  large  and  the  qk  distribution 

2 

was  decreasing  with  heavy  concentration  on  q^.  In  these 
data,  as  in  the  corresponding  simulation  data,  the  principal 
components  were  superior  to  the  principal  predictors. 

This  result  is  similar  to  Burket ' s  (1964)  finding 
that  the  principal  components  correlating  greatest  with 
the  criterion  do  not  validate  as  well  as  the  largest  prin¬ 
cipal  components.  It  appears  to  be  an  advantage  to  select 
linear  combinations  of  the  predictors  independently  of 
criterion  information  in  order  to  maximize  cross-validity. 

In  order  for  the  conclusions  of  a  simulation  study 
to  apply  to  real  prediction  situations,  it  must  be  shown 
that  the  simulated  distributions  are  similar  to  real  dis¬ 
tributions  in  relevant  characteristics.  Several  sections 
of  this  study  were  concerned  with  this  demonstration. 

In  Section  3>1  it  was  shown  that,  when  the  population  mul¬ 
tiple  correlation  is  zero,  the  simulation  sample  statistics 
(multiple  correlation  and  cross-validity)  obey  known 
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statistical  laws.  In  Section  3-2  it  was  shown  that  changes 
in  the  covariance  matrix  of  the  predictors,  keeping  the 
multiple  correlation  constant,  have  no  effect  on  the 
correlation  statistics.  Finally,  in  Section  ^.3,  several 
models  and  samples  were  generated  in  order  to  match  the 
E.  T.  S.  data  more  closely.  The  results  from  this  simu¬ 
lation  were  almost  identical  to  the  E.  T.  S.  results. 

It  would  be  interesting  to  determine  if  other  real 

2  2 

data  have  different  values  of  it  and  different  distri¬ 
butions  than  the  E.  T.  S.  data  and  to  see  if  calculations 
using  these  variables  obey  the  laws  discovered  in  sim¬ 
ulation.  It  is  also  necessary  to  extend  the  calculations 
to  larger  numbers  of  predictors  and  criteria.  Such  work 
would  be  a  further  check  on  the  effectiveness  of  the 
simulation  model  which  was  used  in  this  study. 
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